HA? We don’t need no stinkin’ HA (Part 1)

Posted by Josh | Posted in SQL Server, The Rookie DBA | Posted on 03-21-2011

Tags: ,

3

As a DBA, one of our primary responsibilities is always to ensure that we have a solid availability plan in place for the databases we support. Naturally the plan itself could vary greatly depending upon the product using the database in question. An internal-only reporting database might not require a double redundant, cluster and mirroring backed setup, but a time critical trading application or a patient tracking system for a hospital might well. The real point is, as long as the strategy meets the needs of our business and has been thoroughly vetted and tested, we’re on solid ground.

While I’ve found that it rarely takes much convincing to get support behind availability planning with regard to production systems, I usually find the opposite attitude present with regards to the development environment. When I took over in my new position as the development DBA, about 90% of all my servers had no backups (SQL or OS wise), no build documentation (i.e. who to call if it goes down, what customization such as linked servers or custom assemblies are present, etc), not even so much as a maintenance plan doing regular integrity checks. I was horrified; what happens if one of the databases gets corrupted? What about if a server goes down? Hell, what about if the data center gets flooded or burns down, and we’re forced into a company wide disaster scenario?

The almost universal answers was “It’s development, it doesn’t matter.”

To some degree, I agree with this. If a development server goes down (or two, or three), we’re likely not going to face immediate financial penalties, miss trading deadlines, or cause general mayhem and chaos. We don’t have 24 hour support for development systems, and even when things do happen during normal business hours, our response is usually along the lines of “OK, we’ll look into it when we have a chance”, unless there’s some deliverable in jeopardy.

Sidenote: this isn’t always true. I’ve seen so-called “Development” systems turn into mission-critical ones because the funding to productionize the product dried up, and before you know it, you’ve got developers supporting an enterprise platform running on a dusty server sitting in the data center floor with a note saying “Do not unplug without calling ext xxxx, MISSION CRITICAL” stuck to it. Don’t laugh too hard, it’s probably more common than you think.

On the other hand, let’s consider this scenario:

A development team has been working on a massive, multi-man-month effort to launch a new flagship product for the company. It’s coming down to the wire, and people are working long hours to get the deliverable out. Multiple clients have signed contracts based on the availability of this product, with millions of dollars in revenue streams on the line.

Suddenly, for whatever reason, the main development database gets corrupted or goes offline. How isn’t terribly important, other than to say in my (somewhat short) experience, I’ve concluded that the shoddy setup of Dev environments lends itself more to corruption and loss of data, so it’s eminently possible. Regardless, the database is gone, along with all the code, logic, and data within it.

Let’s assume there are no backups available. What are our options to recover the system?

  1. Rebuild from scratch, using script in source control.

    This could work, but it assumes that developers have been incredibly disciplined about keeping all their changes and work checked in. This is something that I’ve grown to be very skeptical about in my time. Also, any sample data (whether generated or loaded from actual future sources) would not be recovered, and would have to be re-created or re-loaded.

  2. Simply refresh all development environments from Production.

    This assumes that a) there is a production copy available, which if this is a new product, is not likely; b) there are no auditory or security concerns around letting developers have full access to production data (assuming no scrub-scripts are available, again something I’ve grown skeptical about); c) there is some way to determine the difference between the production system and the state of the codebase at the time the database was lost.

Now imagine the pressure that would be coming down on the heads of the developers and their managers, as well as the downstream teams that own the development environment. Clients are breathing down the necks of the business, who is, naturally, passing the pressure right down the chain to the technology groups. Everyone is scrambling madly, trying to get things back on track. Millions are on the line, and clients are threatening to pull out of their contracts if the launch isn’t delivered on time.

Now, how’s your blood pressure? Pulse? Probably just a little elevated; I know mine sure went up when I first imagined this scenario shortly after taking over the Dev systems. I already get a lot of pressure from teams to turn around code reviews and new builds; just think of how crazy they would be going if their whole system disappeared!

I use this little visualization exercise whenever I’m talking to the development teams about the need for me to have tighter control around their systems, and to justify the effort required to keep the backups running and other maintenance going behind the scenes. Almost without fail, their eyes get about as wide a tea saucers, and their demands of me dropping what I’m doing to review their (usually horrible) T-SQL code quiet. Occasionally some folks still balk and don’t get it; that’s fine, I just tell them this strategy is mandated by management (which it is), and leave it there.

So how do we ensure that we’ve got a solid availability plan in place in our Development systems, while still balancing the classic “It’s Development, we don’t care / no one wants to spend any money / time / effort”? Since I’m still struggling with that question myself, I thought I’d ask for everyone’s opinion on the matter. In the next post in the series, I’m going to lay out a poll to see how the community responds to that question. Then, I’m going walk through some options in terms of cost, effort, and whether it meets the needs of the customer (the developers).

Be Sociable, Share!

Comments (3)

[...] part 1 of the series, we went through a little visualization exercise in order to feel what it would be [...]

[...] Part 1 – an introduction to why dismissing the need for HA in development / test systems is not necessarily the right decision. [...]

[...] Part 1 – an introduction to why dismissing the need for HA in development / test systems is not necessarily the right decision. [...]

Write a comment