Posted by Josh | Posted in SQL Server | Posted on 21-02-2011
Tags: indexes, SQL Server, t-sql tuning
Recently I was engaged at work to do some performance tuning on an old legacy application that was being brought up to snuff. I used the wonderful RML Utility toolset to find the most costly statements for the application’s workload, hoping to find some simple, low-hanging fruit such as missing indexes. While I did find some of those items, what I also found was a whole lot of really complex, over-engineered code. Here’s a rough reproduction (obviously I can’t post the real thing) of an example:
SUM(am.AccountValue) / MAX(ts.NumDays) AvgAccountValue,
SUM(bp.FeeAmt) / COUNT(DISTINCT am.RecordDate) AvgFeeAmt
FROM AccountMaster am JOIN TotalSum ts
ON am.FundType = ts.FundType
JOIN BasisPoint bp
ON am.FundType = bp.FundType
AND am.RecordDate = bp.RecordDate
WHERE ClientID = 'AAA'
GROUP BY am.AccountID, am.FundType
At its base the statement seems fairly trivial: calculate an average balance for all accounts for client ‘AAA’ over their lifetime, along with some aggregate fee information. The execution plan was ugly though; some huge table spools and hash joins, with logical reads well over 2 million.
I spent several hours digging into this to try and determine any optimizations. There were no obvious changes to be made, such as index additions or statistics modification, and after some time I was ready to give up. Then, as I was examining the database closer, I noticed two important items:
- The AccountMaster table’s primary key was on the ClientID, AccountID, and RecordDate columns. Meaning, of course, that there could not be multiple rows with the same combination of those columns. This made the
DISTINCTclause above redundant, since there could only be one occurrence of each
ClientID - AccountIDcombination (and we were selecting a
ClientIDvalue and grouping by the
- The TotalSum table was dynamically populated by an earlier step in the same stored procedure as the trouble statement. Because of how it was done, there was no way there would be more than one value of
NumDaysfor any given
FundTypekey. I put a unique index on it, and sure enough no key violations occurred. This means that the
MAXaggregate function above was also worthless, since there would be only one
FundType. Instead I moved the
NumDayscolumn into the
Low and behold, those two changes made a world of difference in the statement’s execution. Logical reads went down to under 100,000 total, a reduction of over 90%. Naturally I had to verify that the results were the same, but, as expected, they were.
The two lessons I take away from this are:
Sometime SQL’s Optimization Engine Is Stupid
Ok, that’s probably putting it a little harshly. But still, you might think that SQL Server would be smart enough to see “Oh, because of the underlying database structure, those DISTINCT and MAX clauses aren’t needed, so I’m going to ignore them.” Then again, computers are only as smart as the people writing code for them, and the optimizer does pretty frakkin’ spectacularly for the most part, so I won’t complain.
Know What Your Database Structure Means
I’m guessing that this code was either written before the database constraints were in place, or written by someone who didn’t a) take the time to look at the constraints and indexes in place, or b) simply didn’t understand what they meant. In the first case, this should have been caught as part of reviewing the affected objects when making changes to the base table structure. In the case of the second… well, don’t get me started on people writing SQL code who don’t understand what it means. That’s one rant I could go all day on.