DatabaseBasics of Statistics in SQL Server 2005

Basics of Statistics in SQL Server 2005

When we talk about statistics in SQL Server, we’re not discussing a
database to store your baseball card collection. No, statistics in SQL
Server refers specifically to information that the server collects about
the distribution of data in columns and indexes. This data in turn is used
by the query optimizer to determine the plan of attack for returning
results when you run a query. While in the majority of cases SQL Server
takes care of this statistical recordkeeping for you automatically, it’s
useful to have some understanding of what’s going on, and to know when you
might need to intervene.

Automatic Statistics

By default, SQL Server databases automatically create and update
statistics. Although you can disable this (by setting a property of
the database), you should almost certainly leave it on. SQL Server
needs these statistics to do a good job with query processing, and
it’s likely to be smarter than you about what and when to update. The
information that gets stored includes:

  • The number of rows and pages occupied by a table’s data
  • The time that statistics were last updated
  • The average length of keys in a column
  • Histograms showing the distribution of data in a column
  • String summaries that are used when performing LIKE queries on
    character data

SQL Server maintains statistics on index and key columns of all of your
tables – in other words, the columns that the query processor is able to
use in various ways in queries depending on the data that they
contain.

Viewing Statistics

You can get an idea of the statistics that SQL Server is maintaining on
a table from either T-SQL or the SQL Server Management Studio user
interface. Let’s look at the T-SQL solution first. The place to begin is
with the sp_helpstats system stored procedure, which will
list all of the statistics being maintained on a table. For example, in
the AdventureWorks sample database, you can find the statistics on the
Sales.SalesOrderDetail table this way:


sp_helpstats 'Sales.SalesOrderDetail', 'ALL'

statistics_name                                      statistics_keys
---------------------------------------------------- ---------------------------------
_WA_Sys_00000006_245D67DE                            SpecialOfferID
_WA_Sys_0000000B_245D67DE                            ModifiedDate
AK_SalesOrderDetail_rowguid                          rowguid
IX_SalesOrderDetail_ProductID                        ProductID
PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID  SalesOrderID, SalesOrderDetailID

Given an individual statistic, you can use DBCC SHOW_STATISTICS
to retrieve detailed information:


DBCC SHOW_STATISTICS('Sales.SalesOrderDetail', IX_SalesOrderDetail_ProductID)

Name                           Updated              Rows                 
------------------------------ -------------------- -------------------- 
IX_SalesOrderDetail_ProductID  Oct 14 2005  1:59AM  121317               

Rows Sampled         Steps  Density       Average key length String Index
-------------------- ------ ------------- ------------------ ------------
121317               197    0.007550886   12                 NO 

(1 row(s) affected)

All density   Average Length Columns
------------- -------------- -------------------------------------------
0.003759399   4              ProductID
8.242868E-06  8              ProductID, SalesOrderID
8.242868E-06  12             ProductID, SalesOrderID, SalesOrderDetailID

(3 row(s) affected)

RANGE_HI_KEY RANGE_ROWS    EQ_ROWS       DISTINCT_RANGE_ROWS  AVG_RANGE_ROWS
------------ ------------- ------------- -------------------- --------------
707          0             3083          0                    1
708          0             3007          0                    1
709          0             188           0                    1
710          0             44            0                    1
711          0             3090          0                    1
(additional rows removed)

The first section of the DBCC SHOW_STATISTICS output gives
you general information on the statistic, including how many rows went
into building the statistic (SQL Server may use only a sample of the data
if the statistic is on a non-indexed column). The second section shows the
sets of columns that this statistic can be used for, and statistical
measures of how well this statistic will help a query locate data in those
columns. The final section is the histogram showing the distribution of
data. The columns in this section show the key value that demarcates the
range, the number of rows that are in the range but not equal to the
upper bound,
the number of rows in the range that are equal to the upper bound, the
number of distinct values in the range (not counting the upper bound), and
the average number of duplicate values in the table for this range (which
helps measure the selectivity of the statistic). Even without
understanding the math behind query optimization in detail, it should be
obvious that this is useful information.

Of course, these days there are many developers and administrators who
prefer to work with the GUI administration tools rather than raw SQL.
Fortunately, all of this information on statistics is also available in
SQL Server Management Studio. Connect to a database and expand the Object
Explorer tree until you find the table of interest, then drill into the
Statistics folder underneath the table. There you’ll find a graphical
representation of all of the statistics on the table, as shown in Figure
1.

Statistics on a table

If you right-click on one of these statistics and select Properties,
you’ll see the general information on the statistic, as shown in Figure
2.

General properties of a statistic

Selecting the Details page shows detailed information on the statistic,
as shown in Figure 3. You may recognize this format – the SQL Server
Management Studio developers seem to have simply piped the
DBCC SHOW_STATISTICS output to the user interface.

Detailed properties of a statistic

Updating Statistics

Just as the data in your database isn’t static, statistics aren’t
either. As you add, delete, and update rows, SQL Server needs to adjust
the statistics so that they still provide good guidance for the query
optimizer. When you run a query, SQL Server checks statistics used in
that query’s plan to
see whether they’re outdated. It uses these heuristics to determine if a
statistic is outdated:

  • If the rowcount in the table has gone from zero to any nonzero
    value, statistics are outdated.
  • If the number of rows in the table was under 500, and at least 500
    rows have been modified (added/deleted/edited), statistics are
    outdated.
  • If the number of rows in the table was over 500, and at least 500
    plus 20% of the rowcount rows have been modified (added/deleted/edited),
    statistics are outdated.

Normally statistics are updated as soon as it’s discovered that they
should be, which can hold up the execution of a query. You can set the
AUTO_UPDATE_STATISTICS_ASYNC database property to have the
update performed asynchronously instead, which may result in less-accurate
query plans but faster start of execution.

Sometimes, you may find that the heuristics aren’t good enough for
telling when statistics should be updated. For example, if you’re adding a
lot of rows at the end of the range of a column (think about adding
autonumbers, or timestamps), the histogram for that column will become
outdated rapidly. You can force an immediate update from T-SQL by
executing the UPDATE STATISTICS statement:


UPDATE STATISTICS table | view 
    [ 
        { 
            {index | statistics_name}
          | ({index |statistics_name} [,...n]) 
        }
    ] 
    [WITH 
        [ 
            [FULLSCAN] 
            |SAMPLE number {PERCENT | ROWS}] 
            |RESAMPLE 
            |<update_stats_stream_option>[,...n]
        ] 
        [[,] [ALL | COLUMNS | INDEX] 
        [[,] NORECOMPUTE] 
    ] ;

<update_stats_stream_option> ::=
    [STATS_STREAM = stats_stream]
    [ROWCOUNT = numeric_constant]
    [PAGECOUNT = numeric contant]

As you can see, you can update all of the statistics for a table or
view, or pick a particular statistic to update by index name or statistic
name. You can also specify some update options:

  • FULLSCAN forces a complete read of every row of
    data.
  • SAMPLE lets you specify how much of the data SQL
    Server should sample in building statistics. By default, it will figure
    out its own sampling coverage.
  • RESAMPLE uses the same sampling rate as the last time
    the statistic was gathered.
  • ALL, COLUMNS, or INDEX specify whether
    the update should be to all statistics, column statistics, or index
    statistics. The default is all statistics.
  • NORECOMPUTE turns off automatic statistics gathering
    for this table or statistic.

The Undocumented Options

If you check Books Online, you’ll find the
update_stats_stream_option documented as “This syntax is for
internal use only and is not supported. Microsoft reserves the right to
change this syntax at any time.”

The STATS_STREAM option, so far as I know, remains
undocumented. But we now know what the other two options do, thanks to the
new blogging spirit of openness and a
blog posting
from the SQL Server Query Optimization Team. It turns out
that ROWCOUNT and PAGECOUNT can be used to trick
the query optimizer by letting you fake the internal metadata about a
particular table – specifically the count of rows and count of pages that
the table occupies.

If you’re going to experiment with this, please do not do so in
a production database!

For example, you can vastly increase the apparent rowcount and
pagecount of the Sales.SalesOrderDetail table by issuing this command:


UPDATE STATISTICS Sales.SalesOrderDetail 
WITH ROWCOUNT=5000000, PAGECOUNT = 5000000

Why would you want to do this? SQL Server’s query optimizer does things
differently on small tables and on large tables. During development, you
likely don’t have a whole lot of data in your tables. Rather than spend
time generating huge amounts of sample data, you can use the undocumented
options of UPDATE STATISTICS to see how the query optimizer
will treat your tables after they’ve been in production for a while. This
can be a quick and dirty method to uncover potential performance issues
before they arise, and perhaps point out the need for additional indexes
or other schema work. I’d recommend confirming any such changes by
retesting with realistic amounts of sample data, though.

Using Statistics Wisely

In most cases, SQL Server 2005 is so well tuned in this area that you
only need to remember a single best practice: leave the statistics set to
automatically update and forget about them. If you’re having a problem
with particular queries, there are a few things you can do to make
statistics more accurate. These include running an update with
FULLSCAN, creating multicolumn statistics if you have
multiple-column WHERE clauses that aren’t covered by existing
statistics, and updating statistics more frequently if you’re filling a
table with ascending data. Armed with the ability to evaluate the quality
of statistics, you should be able to decide whether you need to undertake
any of these steps in your own databases.

About the Author

Mike Gunderloy is the author of over 20 books and numerous articles on
development topics, and the Senior Technology Partner for Adaptive Strategy, a
Washington State consulting firm. When
he’s not writing code, Mike putters in the garden on his farm in eastern
Washington state.

Latest Posts

Related Stories