Basics of Statistics in SQL Server 2005
When we talk about statistics in SQL Server, we're not discussing a database to store your baseball card collection. No, statistics in SQL Server refers specifically to information that the server collects about the distribution of data in columns and indexes. This data in turn is used by the query optimizer to determine the plan of attack for returning results when you run a query. While in the majority of cases SQL Server takes care of this statistical recordkeeping for you automatically, it's useful to have some understanding of what's going on, and to know when you might need to intervene.
Automatic Statistics
By default, SQL Server databases automatically create and update statistics. Although you can disable this (by setting a property of the database), you should almost certainly leave it on. SQL Server needs these statistics to do a good job with query processing, and it's likely to be smarter than you about what and when to update. The information that gets stored includes:
- The number of rows and pages occupied by a table's data
- The time that statistics were last updated
- The average length of keys in a column
- Histograms showing the distribution of data in a column
- String summaries that are used when performing LIKE queries on character data
SQL Server maintains statistics on index and key columns of all of your tables - in other words, the columns that the query processor is able to use in various ways in queries depending on the data that they contain.
Viewing Statistics
You can get an idea of the statistics that SQL Server is maintaining on
a table from either T-SQL or the SQL Server Management Studio user
interface. Let's look at the T-SQL solution first. The place to begin is
with the sp_helpstats system stored procedure, which will
list all of the statistics being maintained on a table. For example, in
the AdventureWorks sample database, you can find the statistics on the
Sales.SalesOrderDetail table this way:
sp_helpstats 'Sales.SalesOrderDetail', 'ALL'
statistics_name statistics_keys
---------------------------------------------------- ---------------------------------
_WA_Sys_00000006_245D67DE SpecialOfferID
_WA_Sys_0000000B_245D67DE ModifiedDate
AK_SalesOrderDetail_rowguid rowguid
IX_SalesOrderDetail_ProductID ProductID
PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID SalesOrderID, SalesOrderDetailID
Given an individual statistic, you can use DBCC SHOW_STATISTICS
to retrieve detailed information:
DBCC SHOW_STATISTICS('Sales.SalesOrderDetail', IX_SalesOrderDetail_ProductID)
Name Updated Rows
------------------------------ -------------------- --------------------
IX_SalesOrderDetail_ProductID Oct 14 2005 1:59AM 121317
Rows Sampled Steps Density Average key length String Index
-------------------- ------ ------------- ------------------ ------------
121317 197 0.007550886 12 NO
(1 row(s) affected)
All density Average Length Columns
------------- -------------- -------------------------------------------
0.003759399 4 ProductID
8.242868E-06 8 ProductID, SalesOrderID
8.242868E-06 12 ProductID, SalesOrderID, SalesOrderDetailID
(3 row(s) affected)
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------ ------------- ------------- -------------------- --------------
707 0 3083 0 1
708 0 3007 0 1
709 0 188 0 1
710 0 44 0 1
711 0 3090 0 1
(additional rows removed)
The first section of the DBCC SHOW_STATISTICS output gives
you general information on the statistic, including how many rows went
into building the statistic (SQL Server may use only a sample of the data
if the statistic is on a non-indexed column). The second section shows the
sets of columns that this statistic can be used for, and statistical
measures of how well this statistic will help a query locate data in those
columns. The final section is the histogram showing the distribution of
data. The columns in this section show the key value that demarcates the
range, the number of rows that are in the range but not equal to the
upper bound,
the number of rows in the range that are equal to the upper bound, the
number of distinct values in the range (not counting the upper bound), and
the average number of duplicate values in the table for this range (which
helps measure the selectivity of the statistic). Even without
understanding the math behind query optimization in detail, it should be
obvious that this is useful information.
Of course, these days there are many developers and administrators who prefer to work with the GUI administration tools rather than raw SQL. Fortunately, all of this information on statistics is also available in SQL Server Management Studio. Connect to a database and expand the Object Explorer tree until you find the table of interest, then drill into the Statistics folder underneath the table. There you'll find a graphical representation of all of the statistics on the table, as shown in Figure 1.
If you right-click on one of these statistics and select Properties, you'll see the general information on the statistic, as shown in Figure 2.
Selecting the Details page shows detailed information on the statistic,
as shown in Figure 3. You may recognize this format - the SQL Server
Management Studio developers seem to have simply piped the
DBCC SHOW_STATISTICS output to the user interface.
