Archive for the 'SQL / Programming' Category

Covering Index = Epic Win, Part 3

In part 2 of this series, we looked at the lengths SQL Server will go just to avoid the dreaded bookmark lookup, including ignoring an index altogether and choosing instead to scan the entire clustered index. But, we also showed how that decision was, in fact, a smart one, due to it resulting in fewer logical IOs.

But let’s face it, neither path (bookmark lookup or clustered index scan) is very efficient or scalable. So, what can we do to give SQL some better tools?

The answer, it seems, is surprisingly simple: create a covering index. That is, we need to create (or modify an existing one) an index that contains all the columns required for the query. But how do we determine what those columns are?

The easiest way I’ve found is to examine our execution plan, specifically the Bookmark Lookup node. Right click on the icon and choose “Properties”. Then, in the properties window, look for a line called “Output List”. Note: if your plan includes a scan, table or index, then use that operator as your root node instead.

If you click on that little “…” icon, what you’ll see is a nice list of the columns being requested by the query:

In this case, we see that the query is requesting the AccountID and Amount columns. So, how do we add these to the index?

Well, there are two options: 1) add them to the index key itself, 2) add them as included columns. For reasons I’ll illustrate in a moment we’ll see that number 2 is the better option, at least in this specific case.

First, adding them to the key:


DROP INDEX NCI_tempA_TransDate ON dbo.tmpA;
GO
CREATE INDEX NCI_tempA_TransDate ON dbo.tmpA (TransDate, AccountID, Amount);

Now, let’s run the same SELECT query:


SELECT AccountID, Amount FROM dbo.tmpA WHERE TransDate = '1/1/2010';

And the resulting execution plan:

…and STATISTICS IO:

Table 'tmpA'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Both clearly an improvement over the previous results seen in part two (lookup and 217 logical reads).

“Great!” you say, “I’ll go start adding columns to my indexes right now!”

Hold on there… this isn’t all rainbows yet. You see, by adding those columns to the index key itself, we could be incurring additional overhead in terms of index maintenance. Plus, I’m pretty sure if index gurus like Paul Randal (blog | twitter) catch you adding things like large VARCHAR columns to an index key, without them actually being used in filter operations, they just might hunt you down and… well, let’s just say I doubt it’d be pretty. Note: if anyone out there can tell me how to show the difference in behavior via an execution plan or STATISTICS IO I will update this page and give them due props, as I was not able to get any difference between the two plans.

So now let’s look at option two: add the columns using the INCLUDE functionality in SQL 2005+:


DROP INDEX NCI_tempA_TransDate ON dbo.tmpA;
GO
CREATE INDEX NCI_tempA_TransDate_inc ON dbo.tmpA (TransDate)
INCLUDE (AccountID, Amount);

Running our SELECT statement gives an identical execution plan and number of logical IOs as when the columns are placed in the index key. But now we’ve skipped making our index key wider than necessary. Truly a great feature.

Before I close out the series, I should mention that even this second path doesn’t appear to be completely devoid of peril. By adding these INCLUDEd columns, you will be adding to the amount of storage required by the index. For example, the following information was returned by running a query against the sys.dm_db_index_physical_stats DMV, using the page_count column for indexes on our tmpA table:

For explanatory purposes, the indexes are named as follows:

  • NCI_tempA_TransDate_inc – key = TransDate, included = AccountID, Amount
  • NCI_tempA_TransDate_inckey – key = TransDate, AccountID, Amount
  • NCI_tmpA_TransDate – key = TransDate, no included columns

So as you can see, adding INCLUDED columns to the key does indeed increase the storage required by the index. So do keep that in mind when you perform this kind of tuning work, especially on larger databases.

And so we’ve come to the end of our little series. Hopefully you will find this information useful the next time you are tuning a query and notice our little friend, the bookmark lookup (or even a index / tables scans). If anyone has any great stories to share on how this functionality helped them solve a production issue (or prevent one in development), I’d love to hear it in the comments. Props to anyone who can beat my best of reducing a query’s run time from 27 minutes to 44 seconds!

INSERT dbo.tmpA (AccountID, Amount, TransDate)
SELECT  TOP 100 name, CAST(object_id as MONEY), ‘1/5/2010′
FROM sys.columns
Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter

Covering Index = Epic Win, Part 2

In part one of this series, we looked at how a non-clustered index can be made into a real performance bottleneck because of a nefarious little fellow called a “bookmark lookup”. We showed how this operation, which is how SQL Server retrieves the requested columns not part of an index, can result in excess IOs and generally poorly scaling performance.

In this part, we’ll look at how SQL Server will even go so far as to ignore an index completely, just to avoid our little friend the lookup.

First off, let’s look again at our query execution plan from our SELECT query shown in part one:

We see how the most expensive part of the query is the “Key Lookup” operation, by a margin on 99 to 1. Pretty obvious that it would be the bottleneck in this case.

Now this particular query returned exactly one-hundred rows. Now let’s see what happens if we increase the number of returned rows a little bit.


SELECT AccountID, Amount FROM dbo.tmpA WHERE TransDate = '1/2/2010';

This query returns around one thousand results. Let’s look at the execution plan here:


Whoa, wait a second. Why is SQL Server using a clustered index scan operation? Don’t we have an index on the column being filtered on?

Well, yes we do. But see, SQL is generally pretty smart about choosing how to execute it’s queries (though at times I might debate that statement). And while I can’t speak for how it’s making these decisions, I would hedge a bet that one of the deciding factors in constructing an execution plan is IO cost. So, let’s look at the output of STATISTICS IO for this query.

Table 'tmpA'. Scan count 1, logical reads 549, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

549 logical reads total. Now, let’s force SQL to use our index by putting an index hint in the query:


SELECT AccountID, Amount FROM dbo.tmpA WITH (INDEX = NCI_tmpA_TransDate) WHERE TransDate = '1/2/2010';

Now you’ll have to take my word for it, but the execution plan now looks identical to the one we saw in part one. However, let’s look at the STATISTICS IO output:

Table 'tmpA'. Scan count 1, logical reads 2076, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Well that looks pretty bad. The number of logical IOs increased from 549 to over two-thousand. Now we see why SQL might choose to simply scan the clustered index.

So, why would it choose to use the index in the first case? Well, let’s look at STATISTICS IO for our original query:

SELECT AccountID, Amount FROM dbo.tmpA WHERE TransDate = '1/1/2010';
Table 'tmpA'. Scan count 1, logical reads 217, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Ok, so a total of 217 logical reads, and the index is being used. Now, let’s force it to use the clustered index and see what we get.

SELECT AccountID, Amount FROM dbo.tmpA WITH (INDEX = PK_CI_tmpA) WHERE TransDate = '1/1/2010';
Table 'tmpA'. Scan count 1, logical reads 549, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

See? In the case of our first query, SQL again made the right choice, choosing the operation which resulted in only 249 logical IOs, versus 549 for the clustered index scan.

Somewhere, a SQL developer is probably yelling “See?! Why do you doubt me?!”

OK SQL, we’ve now shown that you make the right call on what operation will be the least expensive to service the query we gave you. But, that doesn’t mean we can’t give you better tools to do the job. That’s what we’ll be covering in part three of our series. I promise we’ll get to the “epic win” part there too.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter

Covering Index = Epic Win, Part I

As if I needed yet another reason to push all of the products I support on to SQL 2005 and up, but in the last few days I’ve been saved at least three times by the wonder that is the INCLUDE columns feature.

From BOL:

INCLUDE (column [ ,... n ] )
Specifies the non-key columns to be added to the leaf level of the nonclustered index. The nonclustered index can be unique or non-unique.

Column names cannot be repeated in the INCLUDE list and cannot be used simultaneously as both key and non-key columns. Nonclustered indexes always contain the clustered index columns if a clustered index is defined on the table. For more information, see Index with Included Columns.

All data types are allowed except text, ntext, and image. The index must be created or rebuilt offline (ONLINE = OFF) if any one of the specified non-key columns are varchar(max), nvarchar(max), or varbinary(max) data types.

Computed columns that are deterministic and either precise or imprecise can be included columns. Computed columns derived from image, ntext, text, varchar(max), nvarchar(max), varbinary(max), and xml data types can be included in non-key columns as long as the computed column data types is allowable as an included column.

In all three cases that I worked, the issue at hand was that some older, rather dusty stored procedures were performing badly. In one case, the procedure took almost thirty minutes to return a mere six thousand rows to be used in a report. Needless to say, because we all know how important reports are to those who use them, this was suddenly deemed totally unacceptable and had to be investigated at once. Luckily, I do love performance tuning.

So upon examining the execution plan of the query in question, the main points of note were:

  1. The majority of the total query cost was taken up by clustered index scan and bookmark lookup operations against large tables.
  2. In all cases the predicate of those operations was a date range (these are tables that hold various kinds of transactions).
  3. In one case, the query executed using a loop join with the transactional table being both the lower (multiple execution) and upper part of the join, courtesy of an index scan and clustered index scan. Notably, the estimated number of rows from the top portion was only one, whereas the actual number was something in the thousands.

In this post (and probably the next one or two), I’m going to go about reproducing the issues described above and show how with some simple index tuning, problems were all but eliminated. That’s right, no code changes. While I am certainly a capable T-SQL coder, I am just as loathe as anyone to crack open these ancient stored procs, whose owners are long gone and yet still are highly critical in daily operations. I don’t know about other DBAs, but my work is in production support, not development and testing code.

Setup Work

First we’re going to create a test table, put an index on it, and put some data in it:


IF OBJECT_ID('dbo.tmpA') IS NOT NULL
 DROP TABLE dbo.tmpA;
GO

CREATE TABLE tmpA
(
 ID INT IDENTITY(1,1),
 TransDate DATETIME,
 AccountID VARCHAR(255),
 Amount MONEY,
 CONSTRAINT PK_CI_tmpA PRIMARY KEY CLUSTERED (ID)
);
GO

CREATE NONCLUSTERED INDEX NCI_tmpA_TransDate ON dbo.tmpA (TransDate);

-- Insert some records --
INSERT  tmpA
(
 TransDate,
 AccountID,
 Amount
)
SELECT  TOP 100
 '1-1-2010',
 b.name,
 CAST(a.object_id AS MONEY)
FROM    sys.columns a, sys.columns b
ORDER BY b.object_id;

INSERT  tmpA
(
 TransDate,
 AccountID,
 Amount
)
SELECT  TOP 1000
 '1-2-2010',
 b.name,
 CAST(a.object_id AS MONEY)
FROM    sys.columns a, sys.columns b
ORDER BY b.object_id;

INSERT  tmpA
(
 TransDate,
 AccountID,
 Amount
)
SELECT  TOP 10000
 '1-3-2010',
 b.name,
 CAST(a.object_id AS MONEY)
FROM    sys.columns a, sys.columns b
ORDER BY b.object_id;

INSERT  tmpA
(
 TransDate,
 AccountID,
 Amount
)
SELECT  TOP 100000
 '1-4-2010',
 b.name,
 CAST(a.object_id AS MONEY)
FROM    sys.columns a, sys.columns b
ORDER BY b.object_id;
GO

Now, let’s execute a little query. Make sure you have the “Display actual execution plan” option in SSMS turned on.


SELECT AccountID, Amount FROM dbo.tmpA WHERE TransDate = '1/1/2010';

Now, take a look at the execution plan. What you’ll see is probably something similar to this:

See that little “Key Lookup” operator? Notice how it is consuming a whopping 99% of the total query cost? That my friends, is a bad little son of a…. well, you get my drift.

From Grant Fritchey (twitter | blog) and Sajal Dam’s (anyone know if Sajal is on Twitter or has a blog? I couldn’t find one but will gladly put it here) outstanding book SQL 2008 Query Performance Tuning Distilled (p. 163 to be exact and give proper credit):

A major overhead associated with nonclustered indexes is the cost of excessive key lookups, commonly known as bookmark lookups, which are a mechanism to navigate from a nonclustered index row to the corresponding data row in the clustered index or the base table.

When a SQL query returns a small number of rows, the optimizer can use the nonclustered index, if available, on the column(s) in the WHERE or JOIN clause to retrieve the data. If the query refers to columns that are not part of the nonclustered index used to retrieve the data, then navigation is required from the index row to the data row in the table to access these columns.

Now this behavior was only seen in a few of the execution cases I was studying, namely those with smaller result sets (that makes sense I suppose given what was stated above about “small number of rows”). In most of the other examples, instead of this bookmark lookup, I was seeing a simple clustered index scan. Nonetheless, I did want to point out this (commonly seen in my somewhat small experience) kind of behavior, since we know it can be a major performance bottleneck. It’s also a good example of how you cannot just through an index on a oft-queried column and assume that everything will be fine from there out.

Next time we’ll see how even more useless indexes can be, but also begin talking about how we can make them worth their weight in gold.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter

Narrow Focus: Blessing Or Curse?

I think pretty much everyone who knows me would say that one of my defining characteristics is my singular focus. Once I put my mind to solving some problem or accomplishing something, I damn well poke at it until it’s finished. Now, in many cases this trait has served me quite well; it’s what allowed me to learn programming and SQL Server with no formal training or even so much as a seminar (thank you Internet and amazon.com), and why I consider myself an above average problem solver.

One of my favorite teachers once told me that at times I “missed the forest for the trees.” He went on to explain that at times my work tended to find it’s conclusion too early, then ignore all other points of view. Narrow focus is often good in writing, but not when it excludes other angles of approach completely.

The same could be said of how I function as a DBA. Take a project I am working on: the design and build of an enterprise data warehouse, storing performance data for every SQL server in our environment. It’s been fascinating and a tremendous learning experience, but also at time a real challenge.

Early on one of the requirements was to collect performance data on a second-by-second basis, essentially forming a “baseline” of data to review and use for later comparison. Now being the aggressive lad I am, I expanded upon this as follows: the warehouse should now store second-by-sec0nd data for every server at all times, with a retention of 2 years. Yes, you read that correctly. Two years of data, with points at every second for fifty plus servers. That’s a lot of data folks (think 1,800 rows per minute, per server, or around 129,600,000 rows per day).

So, I set out to prove it could be done. Here’s a brief list of some of the challenges I’ve hit along the way:

  • PerfMon’s built-in direct-to-SQL functionality scales downright awfully. The structure of the tables would make any decent DBA shudder (the primary key of the main, and largest, table has a GUID as it’s first column, as an example), the use of bulk inserts (not using MS’s own optimization schemes), and statistics issues to name a few.
  • After abandoning that approach we are going with dumping the performance data into flat files, which are then picked up by a SSIS package at intervals, which required a custom script component acting as a data flow source. This is due to the fact that PerfMon records its data in .csv files with a dynamic and unpredictable number and order of columns. Custom logic had to be written to essentially perform a reverse pivot of the data into a more normalized form.
  • Even fully optimized, the package simply could not keep up with the incoming flood of files. They would eventually pile up and consume disk space, not to mention having stale data in the warehouse. This was attacked by custom coding a script task in a parent SSIS package that would dynamically spawn multiple threads running a child package to import one file each, up to ten at once (I’m shocked by the way that this isn’t either a standard part of SSIS or that a suitable aftermarket component could not be found). Naturally this promptly brought the server to its knees, being a poor little VM with only four CPUs. But the files did import faster!

Now, throughout all of this, keep in mind that I was going far beyond the original and still stated requirements given to me. Why? Because dammit, I was out to prove that such a huge amount of data could be handled by a little VM running SQL 2008 Standard Edition.

Well, today I cried Uncle. In an effort to actually get some usable data into the system, we’re (read, I’m) reducing our goals to only baselining one server at a time, with others collecting a mere four data points per minute. And damn, I am disappointed we haven’t gotten there yet. But you can believe I’m not giving up. But at the same time, I look back and consider all the time I spent chasing what, pride?

My old teacher was right: I need to stop staring at the one big tree in front of me and enjoy the woods a bit more. Luckily, there’s this little methodology I follow called GTD, and something tells me it’s going to be very useful in this upward movement of focus.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter

What Is The Role Of A DBA?

Lately this question has been a bit of a hot discussion topic at work. Lots of folks have differing opinions, many of which have merit. So I’m curious, what do folks think the true primary role of a DBA is?

For those of you who might answer “it depends on what the client wants”; that’s a perfectly true answer.  If that’s your opinion, I’d ask that you pick the poll answer closest to what is true for you and your clients.

And if anyone has suggestions on other roles, please leave them in the comments and I’ll gladly look at adding them.

What is the primary role of a DBA?

View Results

Loading ... Loading ...
Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter

The Rookie DBA

So in a bit of personal news, I was recently moved into a new role at work, namely that of a full time database administrator. In my previous role I had quite a bit of interaction with my new team, and gathered a decent amount of knowledge around the workings of SQL Server, at least from the application interacting parts.

It’s been two weeks now, and I must say the change is startling. Already I’ve learned more than I have in the last six months in my old role. There is always something new to learn about SQL, be it the inner workings of the underlying engine, the new features such as compression, encryption, and partitioning (not a new feature, but made much better from what I can see), or digging into some new problem we are facing. I am kept constantly busy, with new demands being thrown at me at a pace even higher than the one I previously faced. A month ago I would have said that wasn’t possible; now I see that I was barely seeing the tip of the amount of effort these guys put into their job when I was working with them.

I’m blessed to have a team of seasoned veterans to work with. Combined they have more than 30 years of experience with SQL Server, networking, server administration, and other areas of IT. They have welcomed me in and made sure to teach me at every possible turn. I’m tremendously grateful to them, and can’t wait to continue learning.

As I do learn and grow into this new job, I thought it might be interesting to post my thoughts and “lessons learned” as time goes on. Who knows; maybe years from now when I myself am a grizzled old pro, I’ll look back on these musing and laugh.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter

SSIS Blowing Up Your Log?

If you have an SSIS package that is causing your transaction log to explode, one of the first places to look would be at the ‘FastLoadMaxInsertCommitSize’ property (this assumes you are using an OLEDB destination). As its name suggests, this property determines the size of batches that SSIS will contain within a transaction before committing. I.e., if you set this to 100,000, SSIS will attempt to insert your data in batches of 100,000 rows, with each being contained within it’s own transaction.

By default this is set to zero, which tells SSIS to attempt to contain the entire bulk insert within one transaction. In my case, that meant over ten million rows! Needless to say, my 2GB transaction log filled up in no time. By changing this value to a more manageable 1,000 rows, the package completed without issue.

Note that in some cases containing your operations within a transaction is a good thing. In this case, it was not necessary, as the package is performing a simple bulk transfer to a reporting database, no atomicity is not a concern.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter

Linked Servers – Good or Bad?

From Linchi Shea’s blog:

In SQL Server, it is rather handy to retrieve data from a different SQL Server instance and use the result locally in another SQL statement for further processing. In theory and in the set purists’ fantasy land, it shouldn’t matter where you get your data or even how you get the data as long as you can use them to further compose a set-based solution.

That is all fine. After all, in a distributed environment the data you want may not be on the same instance where the processing takes place. Well, that is fine until it comes to performance and troubleshooting. If you use linked servers indiscriminately, sooner or later you’ll run into these issues, and they are not pretty.

We have used linked server functionality at various points at work. As Linchi points out, it can be quite useful when you have data distributed across multiple systems. But I have seen cases where it’s caused issues, such as excess network traffic, authentication problems (anyone actually been able to get Kerberos auth to work for a two-hop Windows auth’d linked server?), and general performance malaise.

So what are some alternatives?

  • Replication – true, it may not solve the network issue, and depending upon the type used, it might cause a delay in data transfer. But it would allow for queries to execute against a single server.
  • Static data transfer using SSIS – I would say this would be fine for offline use of data such as reporting. There have been very few cases in my experience where true real-time data has been a hard business requirement, though at times the point has been, shall we say, vigorously argued.
Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter

Thoughts From A DW Newbie

Lately at work I’ve been on a bit of a SQL kick, as I ease into a new role as “Data Process Expert”. While this won’t totally replace the other functions of my job (project / implementations manager, production support specialist, occasional code geek), I do see it becoming a large share of my time. Analysis / Reporting Services is a fairly hot topic these days, since we’re engaging in a concerted effort to centralize all reporting in the enterprise. It’s a young initiative, sure to have some bumps along the way, but I am enjoying being a part of it.

Anyway, I thought I would post my thoughts on two new skills I’ve been working on: Analysis Services design and Data Warehousing. If trends at work are indicative of anything elsewhere, BI and data analysis are on a lot of people’s lists. Being a total newbie, I think I’m coming in with a bit of a fresh perspective.

This time I’ll be talking about Data Warehousing. In no particular order, here are my thoughts:

  • Doing DW well is really tough. You have to have both a really good picture at a 30,000 foot level of where all your data is coming from, as well as a runway level view to understand the structure and format of individual fields.
  • Depending on your sources, you may spend a good amount of effort just cleaning up invalid data. In my case, I was aggregating metadata from several document management systems, and when forced into a system with real referential integrity, I was amazed as how many holes there were. Assume your data has gaps and design ways to handle them, whether it’s ignoring them or adding them to some “unknown” category.
  • If you’re aggregating data from different instances of the same software, think from the start about how your going to do so. For instance, are you going to want to separate the data for analysis, or will you want to see it all at once (or both)? Will you load it all in one shot, or separately?
  • This is blurring the lines a bit between this post and a future one (Analysis Services), but determining how granular you want your data to be in terms of time aggregation should be an early task. I had to basically redesign my extract process about halfway through after figuring out one of my data points would be better seen at a minute by minute basis.
  • Once you make that decision, put specific steps in your ETL process to ensure your data fits within those boundaries. You can’t assume because you see the first thousand rows in neat 30 minute intervals that all the data meets that standard. Again, spoken from late game experience.

Overall I’ve gained a great deal of respect for the folks who do good work in data warehousing, and lost respect for those that try and do it with a shoehorn mentality.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google Bookmarks
  • Technorati
  • FriendFeed
  • Slashdot
  • Twitter