Archive for June 30th, 2009

Thoughts From A DW Newbie

Lately at work I’ve been on a bit of a SQL kick, as I ease into a new role as “Data Process Expert”. While this won’t totally replace the other functions of my job (project / implementations manager, production support specialist, occasional code geek), I do see it becoming a large share of my time. Analysis / Reporting Services is a fairly hot topic these days, since we’re engaging in a concerted effort to centralize all reporting in the enterprise. It’s a young initiative, sure to have some bumps along the way, but I am enjoying being a part of it.

Anyway, I thought I would post my thoughts on two new skills I’ve been working on: Analysis Services design and Data Warehousing. If trends at work are indicative of anything elsewhere, BI and data analysis are on a lot of people’s lists. Being a total newbie, I think I’m coming in with a bit of a fresh perspective.

This time I’ll be talking about Data Warehousing. In no particular order, here are my thoughts:

  • Doing DW well is really tough. You have to have both a really good picture at a 30,000 foot level of where all your data is coming from, as well as a runway level view to understand the structure and format of individual fields.
  • Depending on your sources, you may spend a good amount of effort just cleaning up invalid data. In my case, I was aggregating metadata from several document management systems, and when forced into a system with real referential integrity, I was amazed as how many holes there were. Assume your data has gaps and design ways to handle them, whether it’s ignoring them or adding them to some “unknown” category.
  • If you’re aggregating data from different instances of the same software, think from the start about how your going to do so. For instance, are you going to want to separate the data for analysis, or will you want to see it all at once (or both)? Will you load it all in one shot, or separately?
  • This is blurring the lines a bit between this post and a future one (Analysis Services), but determining how granular you want your data to be in terms of time aggregation should be an early task. I had to basically redesign my extract process about halfway through after figuring out one of my data points would be better seen at a minute by minute basis.
  • Once you make that decision, put specific steps in your ETL process to ensure your data fits within those boundaries. You can’t assume because you see the first thousand rows in neat 30 minute intervals that all the data meets that standard. Again, spoken from late game experience.

Overall I’ve gained a great deal of respect for the folks who do good work in data warehousing, and lost respect for those that try and do it with a shoehorn mentality.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter