It Pays To Understand Your Database

Posted by Josh | Posted in SQL Server | Posted on 21-02-2011

Tags: , ,

0

Recently I was engaged at work to do some performance tuning on an old legacy application that was being brought up to snuff. I used the wonderful RML Utility toolset to find the most costly statements for the application’s workload, hoping to find some simple, low-hanging fruit such as missing indexes. While I did find some of those items, what I also found was a whole lot of really complex, over-engineered code. Here’s a rough reproduction (obviously I can’t post the real thing) of an example:

At its base the statement seems fairly trivial: calculate an average balance for all accounts for client ‘AAA’ over their lifetime, along with some aggregate fee information. The execution plan was ugly though; some huge table spools and hash joins, with logical reads well over 2 million.

I spent several hours digging into this to try and determine any optimizations. There were no obvious changes to be made, such as index additions or statistics modification, and after some time I was ready to give up. Then, as I was examining the database closer, I noticed two important items:

  1. The AccountMaster table’s primary key was on the ClientID, AccountID, and RecordDate columns. Meaning, of course, that there could not be multiple rows with the same combination of those columns. This made the DISTINCT clause above redundant, since there could only be one occurrence of each RecordDate value per ClientID - AccountID combination (and we were selecting a ClientID value and grouping by the AccountID value).
  2. The TotalSum table was dynamically populated by an earlier step in the same stored procedure as the trouble statement. Because of how it was done, there was no way there would be more than one value of NumDays for any given FundType key. I put a unique index on it, and sure enough no key violations occurred. This means that the MAX aggregate function above was also worthless, since there would be only one NumDays value per FundType. Instead I moved the NumDays column into the GROUP BY clause.

Low and behold, those two changes made a world of difference in the statement’s execution. Logical reads went down to under 100,000 total, a reduction of over 90%. Naturally I had to verify that the results were the same, but, as expected, they were.

The two lessons I take away from this are:

Sometime SQL’s Optimization Engine Is Stupid

Ok, that’s probably putting it a little harshly. But still, you might think that SQL Server would be smart enough to see “Oh, because of the underlying database structure, those DISTINCT and MAX clauses aren’t needed, so I’m going to ignore them.” Then again, computers are only as smart as the people writing code for them, and the optimizer does pretty frakkin’ spectacularly for the most part, so I won’t complain.

Know What Your Database Structure Means

I’m guessing that this code was either written before the database constraints were in place, or written by someone who didn’t a) take the time to look at the constraints and indexes in place, or b) simply didn’t understand what they meant. In the first case, this should have been caught as part of reviewing the affected objects when making changes to the base table structure. In the case of the second… well, don’t get me started on people writing SQL code who don’t understand what it means. That’s one rant I could go all day on.

SSMS Tools Custom Script For Index Information

Posted by Josh | Posted in SQL Server | Posted on 14-01-2011

Tags: , , ,

5

The other day I was having a chat with a developer around our SQL coding standards, specifically the naming conventions. He made an interesting point: the name shouldn’t matter, because it’s the metadata ( what table its on, the index key columns, included columns, etc) that matters. My response was that while that’s true, it’s helpful as a production DBA to have the name, which is readily available via a few clicks in SSMS, be a meaningful one that gives you some inkling of what the index is. This is especially true at 3AM when you’re troubleshooting issues. His response: write a script that you can run to give the information and use that instead of the GUI.

And while I’m not switching my position on naming standards (Have you seen SSMS’s default naming for indexes? Ugh…), I’m up for any challenge, so write a script I did. I actually use this with SSMS Tool‘s “Custom Script” functionality so when I right-click a table, it is easily run.

It returns some basic metadata about the index definition, such as the name, type, and columns (both included and key). I’ll be trying this out and undoubtedly adding more columns in the coming weeks, but wanted to share it.

What useful SSMS ToolPack custom scripts do you have?

Covering Index = Epic Win, Part 3

Posted by Josh | Posted in SQL Server, The Rookie DBA | Posted on 23-06-2010

Tags: , ,

0

In part 2 of this series, we looked at the lengths SQL Server will go just to avoid the dreaded bookmark lookup, including ignoring an index altogether and choosing instead to scan the entire clustered index. But, we also showed how that decision was, in fact, a smart one, due to it resulting in fewer logical IOs.

But let’s face it, neither path (bookmark lookup or clustered index scan) is very efficient or scalable. So, what can we do to give SQL some better tools?

The answer, it seems, is surprisingly simple: create a covering index. That is, we need to create (or modify an existing one) an index that contains all the columns required for the query. But how do we determine what those columns are?

The easiest way I’ve found is to examine our execution plan, specifically the Bookmark Lookup node. Right click on the icon and choose “Properties”. Then, in the properties window, look for a line called “Output List”. Note: if your plan includes a scan, table or index, then use that operator as your root node instead.

If you click on that little “…” icon, what you’ll see is a nice list of the columns being requested by the query:

In this case, we see that the query is requesting the AccountID and Amount columns. So, how do we add these to the index?

Well, there are two options: 1) add them to the index key itself, 2) add them as included columns. For reasons I’ll illustrate in a moment we’ll see that number 2 is the better option, at least in this specific case.

First, adding them to the key:

Now, let’s run the same SELECT query:

And the resulting execution plan:

…and STATISTICS IO:

Both clearly an improvement over the previous results seen in part two (lookup and 217 logical reads).

“Great!” you say, “I’ll go start adding columns to my indexes right now!”

Hold on there… this isn’t all rainbows yet. You see, by adding those columns to the index key itself, we could be incurring additional overhead in terms of index maintenance. Plus, I’m pretty sure if index gurus like Paul Randal (blog | twitter) catch you adding things like large VARCHAR columns to an index key, without them actually being used in filter operations, they just might hunt you down and… well, let’s just say I doubt it’d be pretty. Note: if anyone out there can tell me how to show the difference in behavior via an execution plan or STATISTICS IO I will update this page and give them due props, as I was not able to get any difference between the two plans.

So now let’s look at option two: add the columns using the INCLUDE functionality in SQL 2005+:

Running our SELECT statement gives an identical execution plan and number of logical IOs as when the columns are placed in the index key. But now we’ve skipped making our index key wider than necessary. Truly a great feature.

Before I close out the series, I should mention that even this second path doesn’t appear to be completely devoid of peril. By adding these INCLUDEd columns, you will be adding to the amount of storage required by the index. For example, the following information was returned by running a query against the sys.dm_db_index_physical_stats DMV, using the page_count column for indexes on our tmpA table:

For explanatory purposes, the indexes are named as follows:

  • NCI_tempA_TransDate_inc – key = TransDate, included = AccountID, Amount
  • NCI_tempA_TransDate_inckey – key = TransDate, AccountID, Amount
  • NCI_tmpA_TransDate – key = TransDate, no included columns

So as you can see, adding INCLUDED columns to the key does indeed increase the storage required by the index. So do keep that in mind when you perform this kind of tuning work, especially on larger databases.

And so we’ve come to the end of our little series. Hopefully you will find this information useful the next time you are tuning a query and notice our little friend, the bookmark lookup (or even a index / tables scans). If anyone has any great stories to share on how this functionality helped them solve a production issue (or prevent one in development), I’d love to hear it in the comments. Props to anyone who can beat my best of reducing a query’s run time from 27 minutes to 44 seconds!

INSERT dbo.tmpA (AccountID, Amount, TransDate)
SELECT  TOP 100 name, CAST(object_id as MONEY), ’1/5/2010′
FROM sys.columns

Covering Index = Epic Win, Part 2

Posted by Josh | Posted in SQL Server, The Rookie DBA | Posted on 15-06-2010

Tags: , , , ,

1

In part one of this series, we looked at how a non-clustered index can be made into a real performance bottleneck because of a nefarious little fellow called a “bookmark lookup”. We showed how this operation, which is how SQL Server retrieves the requested columns not part of an index, can result in excess IOs and generally poorly scaling performance.

In this part, we’ll look at how SQL Server will even go so far as to ignore an index completely, just to avoid our little friend the lookup.

First off, let’s look again at our query execution plan from our SELECT query shown in part one:

We see how the most expensive part of the query is the “Key Lookup” operation, by a margin on 99 to 1. Pretty obvious that it would be the bottleneck in this case.

Now this particular query returned exactly one-hundred rows. Now let’s see what happens if we increase the number of returned rows a little bit.

This query returns around one thousand results. Let’s look at the execution plan here:


Whoa, wait a second. Why is SQL Server using a clustered index scan operation? Don’t we have an index on the column being filtered on?

Well, yes we do. But see, SQL is generally pretty smart about choosing how to execute it’s queries (though at times I might debate that statement). And while I can’t speak for how it’s making these decisions, I would hedge a bet that one of the deciding factors in constructing an execution plan is IO cost. So, let’s look at the output of STATISTICS IO for this query.

549 logical reads total. Now, let’s force SQL to use our index by putting an index hint in the query:

Now you’ll have to take my word for it, but the execution plan now looks identical to the one we saw in part one. However, let’s look at the STATISTICS IO output:

Well that looks pretty bad. The number of logical IOs increased from 549 to over two-thousand. Now we see why SQL might choose to simply scan the clustered index.

So, why would it choose to use the index in the first case? Well, let’s look at STATISTICS IO for our original query:

Ok, so a total of 217 logical reads, and the index is being used. Now, let’s force it to use the clustered index and see what we get.

See? In the case of our first query, SQL again made the right choice, choosing the operation which resulted in only 249 logical IOs, versus 549 for the clustered index scan.

Somewhere, a SQL developer is probably yelling “See?! Why do you doubt me?!”

OK SQL, we’ve now shown that you make the right call on what operation will be the least expensive to service the query we gave you. But, that doesn’t mean we can’t give you better tools to do the job. That’s what we’ll be covering in part three of our series. I promise we’ll get to the “epic win” part there too.