Archive for June, 2009

Thoughts From A DW Newbie

Lately at work I’ve been on a bit of a SQL kick, as I ease into a new role as “Data Process Expert”. While this won’t totally replace the other functions of my job (project / implementations manager, production support specialist, occasional code geek), I do see it becoming a large share of my time. Analysis / Reporting Services is a fairly hot topic these days, since we’re engaging in a concerted effort to centralize all reporting in the enterprise. It’s a young initiative, sure to have some bumps along the way, but I am enjoying being a part of it.

Anyway, I thought I would post my thoughts on two new skills I’ve been working on: Analysis Services design and Data Warehousing. If trends at work are indicative of anything elsewhere, BI and data analysis are on a lot of people’s lists. Being a total newbie, I think I’m coming in with a bit of a fresh perspective.

This time I’ll be talking about Data Warehousing. In no particular order, here are my thoughts:

  • Doing DW well is really tough. You have to have both a really good picture at a 30,000 foot level of where all your data is coming from, as well as a runway level view to understand the structure and format of individual fields.
  • Depending on your sources, you may spend a good amount of effort just cleaning up invalid data. In my case, I was aggregating metadata from several document management systems, and when forced into a system with real referential integrity, I was amazed as how many holes there were. Assume your data has gaps and design ways to handle them, whether it’s ignoring them or adding them to some “unknown” category.
  • If you’re aggregating data from different instances of the same software, think from the start about how your going to do so. For instance, are you going to want to separate the data for analysis, or will you want to see it all at once (or both)? Will you load it all in one shot, or separately?
  • This is blurring the lines a bit between this post and a future one (Analysis Services), but determining how granular you want your data to be in terms of time aggregation should be an early task. I had to basically redesign my extract process about halfway through after figuring out one of my data points would be better seen at a minute by minute basis.
  • Once you make that decision, put specific steps in your ETL process to ensure your data fits within those boundaries. You can’t assume because you see the first thousand rows in neat 30 minute intervals that all the data meets that standard. Again, spoken from late game experience.

Overall I’ve gained a great deal of respect for the folks who do good work in data warehousing, and lost respect for those that try and do it with a shoehorn mentality.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter

RTFM Josh!

One of my latest projects has been to get a VPN link up between my parent’s house and ours, so that I can help them out with computer issues without having to make the trek down to West Chester. Not that I don’t love visiting them (shamless parental plug, they do read this after all), but what with Taylor’s arrival and all we don’t get out much these days. Anyway, after thinking about how best to accomplish this I decided I should use a DD-WRT (open source alternative firmware) flashed Linksys router for the job, for a number of reasons:

  • Thanks to the generousity of a local gentlemen I know from Twitter I acquired one of these devices for free. Don’t ever say Twitter can’t get you stuff!
  • It’s low power and unobtrusive. While I love my big ol’ quad core monster, I didn’t really want to take up space at my parents house.
  • With DD-WRT, it functions great as an OpenVPN client.

Now, the setup was to look something like this:

VPN Diagram

(Please excuse the rather awful drawing, I’m afraid I’m used to Visio)

Once I had everything setup, everything appeared to work, except I couldn’t ping anything at my parent’s end of the tunnel from my home network (vice versa worked fine). Well, normally I might try and sniff the traffic at the two endpoints, but this was a little more complicated than you might think, since the router at my parents doesn’t support such an operation.

In then end, I was able to collect some data after hacking a solution (<geekery>I used an already compiled version of tcpdump and manually moved it to the router</geekry>), and figured out that for some reason the VPN router at my paren’t house was performing what’s known as “masquerading”, or “NATing”. What this means is that any traffic that passed out of it was translated such as to appear that it came directly from the router itself, and not some machine behind it. This novel concept is actually the basis for how pretty much every home router functions; but in this case it was bad, in that it made the tunnel between our networks essentially one way.

So what did I do? Well get royally annoyed of course, proceed to spend the next several hours pouring over the setup of both ends, trying various hacks, googling like crazy, and perhaps questioning my intellect (or that of the programmers of OpenVPN, DD-WRT, and Linux in general) a few times. Finally after giving up for a day, I decided I’d look at a tutorial on the OpenVPN site itself I remembered. It took awhile to find, but there was the answer, right in front of me.

Apparently OpenVPN needs a little special tweaking to allow for traffic from both the “client” (my parent’s house) and “server” (my house) networks to fully talk to each other:

<geekery>

Next, we will deal with the necessary configuration changes on the server side. If the server configuration file does not currently reference a client configuration directory, add one now:

client-config-dir ccd

In the above directive, ccd should be the name of a directory which has been pre-created in the default directory where the OpenVPN server daemon runs. On Linux this tends to be /etc/openvpn and on Windows it is usually \Program Files\OpenVPN\config. When a new client connects to the OpenVPN server, the daemon will check this directory for a file which matches the common name of the connecting client. If a matching file is found, it will be read and processed for additional configuration file directives to be applied to the named client.

The next step is to create a file called client2 in the ccd directory. This file should contain the line:

iroute 192.168.4.0 255.255.255.0

This will tell the OpenVPN server that the 192.168.4.0/24 subnet should be routed to client2.

Next, add the following line to the main server config file (not the ccd/client2 file):

route 192.168.4.0 255.255.255.0

Why the redundant route and iroute statements, you might ask? The reason is that route controls the routing from the kernel to the OpenVPN server (via the TUN interface) while iroute controls the routing from the OpenVPN server to the remote clients. Both are necessary.

Yep, you read correctly. “Both are necessary.” Long story short, I only had one of the two necessary configurations in place. Basically, while the normal

route 192.168.a.b 255.255.255.0

command got the packets as far as the OpenVPN tunnel, the software required that extra

iroute 192.168.a.b 255.255.255.0

configuration line to correctly send the traffic on its merry way from there. Without it, the traffic would just appear to go out the proper interface, but actually go to that awful place where packets disappear and are never seen again.

</geekery>

Hi everyone, my name is Josh, and yes, I admit it, I should have read the… oh hell you know what the rest means.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter