Fun With Aggregation – TSQL2SDay #16

Posted by Josh | Posted in SQL Server, T-SQL Programming | Posted on 03-08-2011

4


It’s that time again! This month’s T-SQL Tuesday is being hosted by Jes Borland (blog | twitter), and the topic is: Aggregation!

When I think of aggregation as a T-SQL programmer, my first thought is usually “Why aggregate in T-SQL at all? Why not use Analysis Services?” Now, understand I’m in no way saying that one should never use functions like SUM, AVG, COUNT, etc in T-SQL; they can be very useful and perform well when used appropriately. But when I’m in a conversation and people start talking in phrases such as “Average over time” or “Comparison with the previous period”, I will usually jump in and say “Why don’t we use SSAS for this?”

One of my biggest projects at work was the construction of a data warehouse to store collected PerfMon data across all our SQL servers. Here are two common scenarios for how that data would be used:

  1. Baseline analysis and comparison – how busy is the server compared to others in the farm? Compared to itself last month? Last year?
  2. Point-in-time troubleshooting – The users complained that the application was slow at 11:21 AM on Tuesday; was the database server slow or was it something else? Using the collected data we could easily compare one specific period to a rolling average of that same time frame, and see if there were any appreciable differences.

For each of these use cases, I constructed some MDX scripts to easily allow my fellow DBAs to drill into the details of the cubes and get meaningful information back. So, I thought I’d share some of the tricks I used in writing them!

A Little Background

So that you can best understand the scripts, I wanted to briefly summarize the structure of the PerfMon cubes:
The cube consists of one measure and three dimensions:

  • The CounterData measure, which contains all the actual PerfMon data in decimal form.
  • The Calendar dimension, used to tie a piece of counter data back to the time at which it was captured. Contains information such as Date, Day Of Week, Month, Quarter, etc.
  • The Counter dimension, used to tie a piece of counter data back to a specific counter instance across all servers. Contains information such as Server Name, Object Name (e.g. Processor, Memory, Logical Disk, etc), Instance Name (e.g. ‘M:’ for logical drives), and Counter Name (e.g. “Avg Secs / Write”).
  • The Time dimension, used to tie a counter fact back to the specific time of day it was captured at.

Now, with that out of the way, let’s write some MDX!

Compare Two Periods

In this scenario, what we’re doing is looking at averaged data across two distinct periods of time, and finding the differences. We used this both for baseline comparison and for troubleshooting (i.e. to see if a server really was performing worse than average at a given time).

First, we’re going to create two named members:


WITH MEMBER	[CurrentPeriod] AS 
		ROUND
		(
			AVG
			(
				(
					[dim Calendar].[Date Key].[2010-08-08 00:00:00],
					{[dim Time].[Hour].[17]:[dim Time].[Hour].[19]},
					[dim Calendar].[Week Day].[All],
					[Measures].[Average Counter Value]
				)
			), 3
		), FORMAT_STRING = "#,0.000"
	 MEMBER	[ComparePeriod] AS 
		ROUND
		(
			AVG
			(
				(
					{[dim Calendar].[Date Key].[2010-06-01 00:00:00]:[dim Calendar].[Date Key].[2010-08-09 00:00:00]},
					{[dim Time].[Hour].[17]:[dim Time].[Hour].[19]},
					[dim Calendar].[Week Day].[6],
					[Measures].[Average Counter Value]
				)
			), 3
		), FORMAT_STRING = "#,0.000"

These two members are used to find the two average counter values from the two periods desired for comparison. In this case, we’re comparing Friday August 8th from 5PM – 7PM with all Fridays from 5PM – 7PM ranging from June 1st to August 9th. This might be used to compare a perceived “trouble spot” with the average across a long time span of similar periods. You could easily modify this to compare, say, the last three months to the same three month period the year before:

Next, we construct a member that compares the two time periods:

Finally, we can use these calculated members to construct a nice clean grid showing us a bunch of common PerfMon counters over both periods:

I used this query many times to justify adding additional capacity to our farm, or to tell an application team to take a hike and stop blaming my database server for their app’s poor performance.

Comparison With The Farm

What about finding out how one server compares to all the others in the farm? We can do that too.

First, construct a series of calculated members that find the sum total of counter values across the farm, then the count of servers, then calculates the average counter value:

Then, use that calculated member to look at how a particular counter value compares to the rest of the farm:

In this case, I was comparing processor use between the hours of 9AM and 10AM.

In Summary…

So there you have it: two great ways we can use the power of MDX and SSAS to get some very useful data out of an otherwise massive collection of numbers (the main fact table contains over 4 billion rows and counting). That, I think, is the essence of a good aggregation strategy: slim down an overwhelming amount of data into meaningful chunks.

Be Sociable, Share!

Comments (4)

Hi,
I’ve been trying to do pretty much the same thing as you’ve described. However, I’m a newbie at datawarehousing (let alone the cube) and haven’t found the right SQL 2008R2 incantations to make a table(s) that analysis server will process. I’ve tried a few different things for the data view, all including using a view (or counterdata and counterdetails )and defining dimensions. I converted CounterDateTime to a smalldatetime field, etc. But I haven’t gotten anything to build. Can I ask for your scripts? I’m not handling time correctly I’m sure and I’m pretty sure I haven’t got the keys figured out.
Can you help?

Hi Carl,

From what you’re saying it sounds like you can’t get AS to process the cube from the tables? While I didn’t really talk about the underlying data structure in this post I’d be happy to see if it’s something obvious. Data warehouse table design is a pretty intense topic. Perhaps a good place to start would be with what error you get when you process the cubes?

-Josh

Thanks for the quick answer…
I couldn’t process the cube. I couldn’t process the dimensions alone! (when I right-clicked on the dim in solution explorer).
So I believe I didn’t define them properly.

Background
I started with reclogged perfmon data.
Tables: displayToID, CounterData, CounterDetails.
1) I tried to put them all in a single table (a view) and define measures and dims from there.
SELECT newid() as [ID], c.CounterID, right(c.MachineName, LEN(c.MachineName)-2) as [System], c.ObjectName, c.CounterName, c.InstanceName,
convert(datetime,right(cn.CounterDateTime,LEN(cn.CounterDateTime)-1),10) as [Date], cn.CounterValue
FROM
CounterDetails c inner Join CounterData cn ON c.CounterID=cn.CounterID
WHERE
c.ObjectName != ‘Process’ AND c.InstanceName !=’_Total’ and c.ObjectName !=’LogicalDisk’
(I didn’t really need displaytoID and filtered out some other data…)
2) I also tried starting with the tables and defining measures – define these as ‘fact’ table and define appropriate dim from in these tables.

I think these should have worked, but for my misunderstanding the dimensions (and maybe the surrogate and alternate key definitions).
I tried using newid() and counterid as id’s for dimensions, but my firm grasp seems loose on that point.

How Should I define the dimensions? I’m looking to use the same one you have..

When I saw your blog entry, specifically the MDX, I knew you had already gotten through what I’m tripping over.
Can you send (edited for your company’s sake) versions of the scripts you used to build the warehouse from which you built your cube?
Or the schema of your warehouse with some text explaining the keys used?
Even though I’m still trying, I’ve got the feeling I’m missing something fundamental. After a few AdventureWorks examples I thought I understood, but right now I don’t see what I missed…
thanks
-carl

Ok, I think I’m getting a bit better picture here. So you are using the standard schema created when logging Perfmon counters to a SQL database? I’d actually stop right there, as that structure is prety bad, design-wise. I started out that way but it scaled horribly and I abandoned it. Instead, stash the Perfmon data in .csv files, and load them to a well-designed warehouse using SSIS. Then, just build the dimensions (i.e. Counter, with attributes like Machine Name, Object Name, etc) and the fact off the base tables.

I’ll have to check with the powers at work about sharing the code, since it was mostly done on their time and is therefore their property. The build of that thing could be an entire blog series in and of itself! It was a big success though; scaled to ~50 concurrent servers being monitored with ~30M rows per day loaded in the main fact table.

Write a comment