It Pays To Understand Your Database
Posted by Josh | Posted in SQL Server | Posted on 21-02-2011
Tags: indexes, SQL Server, t-sql tuning
0
Recently I was engaged at work to do some performance tuning on an old legacy application that was being brought up to snuff. I used the wonderful RML Utility toolset to find the most costly statements for the application’s workload, hoping to find some simple, low-hanging fruit such as missing indexes. While I did find some of those items, what I also found was a whole lot of really complex, over-engineered code. Here’s a rough reproduction (obviously I can’t post the real thing) of an example:
|
1 2 3 4 5 6 7 8 9 10 |
SELECT am.AccountID, SUM(am.AccountValue) / MAX(ts.NumDays) AvgAccountValue, SUM(bp.FeeAmt) / COUNT(DISTINCT am.RecordDate) AvgFeeAmt FROM AccountMaster am JOIN TotalSum ts ON am.FundType = ts.FundType JOIN BasisPoint bp ON am.FundType = bp.FundType AND am.RecordDate = bp.RecordDate WHERE ClientID = 'AAA' GROUP BY am.AccountID, am.FundType |
At its base the statement seems fairly trivial: calculate an average balance for all accounts for client ‘AAA’ over their lifetime, along with some aggregate fee information. The execution plan was ugly though; some huge table spools and hash joins, with logical reads well over 2 million.
I spent several hours digging into this to try and determine any optimizations. There were no obvious changes to be made, such as index additions or statistics modification, and after some time I was ready to give up. Then, as I was examining the database closer, I noticed two important items:
- The AccountMaster table’s primary key was on the ClientID, AccountID, and RecordDate columns. Meaning, of course, that there could not be multiple rows with the same combination of those columns. This made the
DISTINCTclause above redundant, since there could only be one occurrence of eachRecordDatevalue perClientID - AccountIDcombination (and we were selecting aClientIDvalue and grouping by theAccountIDvalue). - The TotalSum table was dynamically populated by an earlier step in the same stored procedure as the trouble statement. Because of how it was done, there was no way there would be more than one value of
NumDaysfor any givenFundTypekey. I put a unique index on it, and sure enough no key violations occurred. This means that theMAXaggregate function above was also worthless, since there would be only oneNumDaysvalue perFundType. Instead I moved theNumDayscolumn into theGROUP BYclause.
Low and behold, those two changes made a world of difference in the statement’s execution. Logical reads went down to under 100,000 total, a reduction of over 90%. Naturally I had to verify that the results were the same, but, as expected, they were.
The two lessons I take away from this are:
Sometime SQL’s Optimization Engine Is Stupid
Ok, that’s probably putting it a little harshly. But still, you might think that SQL Server would be smart enough to see “Oh, because of the underlying database structure, those DISTINCT and MAX clauses aren’t needed, so I’m going to ignore them.” Then again, computers are only as smart as the people writing code for them, and the optimizer does pretty frakkin’ spectacularly for the most part, so I won’t complain.
Know What Your Database Structure Means
I’m guessing that this code was either written before the database constraints were in place, or written by someone who didn’t a) take the time to look at the constraints and indexes in place, or b) simply didn’t understand what they meant. In the first case, this should have been caught as part of reviewing the affected objects when making changes to the base table structure. In the case of the second… well, don’t get me started on people writing SQL code who don’t understand what it means. That’s one rant I could go all day on.






