SQL Query Optimization (Part 2) - Optimizing Queries With Indexes
Continuing from the previous part This post continues with details on the role indexes play in SQL queries and how to use indexes effeciently to make our queries better.
Indexes and Page Split
Every b-tree index must maintain the data in correct sort order of index keys. As inserts and updates happen there is a chance that a certain page becomes full. Taking telephone directory example let us suppose that a page can contain 200 telephone numbers at maximum. Let us again say that page number 100 contains telephone number 030000 – 030500 and the page is full. The range is more than 200 because there are some numbers that are not currently assigned to. The index already lists that page 100 contains that range of telephone numbers. Now if we add a number like 030006 we cannot do that to the page as it is full. Our option might be to insert the value there anyways and then shift all the following numbers and re-calculate the index but that is a very costly operation. What happens is that another page is added at the end of the directory and we maintain a link on the page 100 that it continues at say page 7,000. Then half of the data is maintained on the page 100 and half is moved to page 7,000.
This is costly operation as it involves moving data around and we have to pass through links. Thus a field which is constantly modified is not a very good candidate for indexing (if it can be avoided).
Also, related to the problem is that natural primary keys are nearly always unordered. This means a lot of page splits happen when we use natural primary keys for clustered index. One solution to avoid this problem is to use surrogate keys for clustered index.
Some also suggest that using no clustered index might be a better solution (i.e. maintaining tables in heaps) than using clustered index on fields which is not sorted.
Selectivity of Indexes
If the indexed field contains a lot of unique values then there are a lot of index nodes that can be generated in the b-tree. This means lesser number of values to contain on a single leaf. This can improve performance as searches happen faster with few records per leaf.
Imagine an extreme case in which we are trying to sort by a logical field. This field can have can either yes or no value, Thus the b-tree would just contain two leaf nodes and each would potentially have half of the table in it. Searching for certain record with this index is no different than searching the entire table or may be even more expensive because we might need to refer to clustered index for more information.
The selectivity of a field can be checked with DBCC Show_Statistics. Lower the density the higher the selectivity.
While fetching rows from a table the SQL server might use three type of operations
- Table Scans : This operation reads the entire table directly and does not use any indexes. Search on non-indexed fields yield this type of operation and this is very slow.
- Index Scans : This operation reads only the leaf of the indexes. Here range might be specified so that the scan might stop after reading some number of leaves or filter might be applied where only a certain number of leaves are returned.
- Index Seek : This operation traverses the b-tree and returns only a selected number of rows.
Index Scans and Range Scans
Suppose we have a table Claims with 4 columns, Claim_ID (Primary Key, Clustered indexed), Insurance_ID (non-unique Indexed), Provider_ID (non-unique Indexed), Provider_Name (non-indexed)
If we run a query like
Select * from Claims
The SQL Server will be forced to use Index Scan for all the available rows. This means starting with the first entry in the clustered index leaf and sequentially reading all the records in all the leaves. This is still faster than table scan and clustered index helps in giving a very high number of rows per millisecond. However since the whole table needs to be scanned it is still slow.
But for a query like
Select * from Claims where Claim_ID = ‘1234’
The operation is very fast. We notice that the where clause asks the engine to filter one row by cluster index. Hence the server starts by index seek on the cluster index and when it finds the leaf with value 1234 it will return the entire row (which is bound to the clustered index). It doesn’t need to seek any other rows from the b-tree as the clustered index is unique.
It is similar case when we do a range search on the index key. Something like
Select * from Claims where Claim_ID between ‘1234’ and ‘1240’
Will just find the 1234 using the Index seek (b-tree traversal) and then scans the leaves until it finds 2345 because the indexes are sorted. It will return the rows in between so the query doesn’t take much longer than index seek for range finds on the index columns.
This is for small ranges. To allow time for new value insertion a very big range might still need a large number of index seeks. Imagine that a new row is inserted while we are doing an index scan. This can be hazardous and hence a traversing the b-tree again and again is our only option for big ranges. This can increase the query time by much as index scan for large number of records is more time consuming than index scan on sequential leaves.
Non Clustered Indexes
Imagining another case where we need to search by non-unique indexed key with a query like
Select * from Claims where Insurance_ID=’1234’
The operation needs index seek on the Insurance_ID index, an index scan on the same and index seek on clustered index. This happens because the engine needs to traverse the Insurance_ID index to find the 1234 value and since it is non-unique it also needs to scan the leaves sequentially to find all the leaves with the value. Fortunately since indexes are sorted the values are in sequential order. This gives us the cluster indexes which must be index scanned using b-tree traversal in the clustered index tree to fetch all the other information. This operation is called the bookmark lookup.
Non-Covering Indexes and Avoidance of Select *
Traversing the clustered index after the non-clustered index seek and scan is a potentially lengthy operation because the rows are scattered throughout the clustered index. In fact it might be so lengthy that it is no different than a whole index scan on the clustered index which makes it no different than select * from Claims query.
This means we must avoid select * when we are filtering with criteria based on non-clustered index key.
It doesn’t need to be mentioned that select * or any other filter on non-indexed key means scanning the table directly which is the most time consuming of all.
Thus for filtering we should ideally have an index on all the fields that covers all the where, group by and other criteria. For example if we add Provider_Name to the Provider_ID index like
CREATE INDEX IX_Claims_Provider_ID ON Claims (Provider_ID) INCLUDE (Provider_Name)
And then write a query like
SELECT Provider_ID, Provider_Name FROM Claims WHERE Provider_ID=’1234’
The Provider_Name can be found directly in the leaf of the index b-tree and hence does not need bookmark lookup. This is called a covering index for the query.
The order of the query still matters in the case of the covering query however. For example if we do a
SELECT Provider_ID, Provider_Name FROM Claims WHERE Provider_Name =’John Doe’
Then it would still mean that the first part of the index Provider_ID would be useless here and we would be essentially doing a full index scan and result filtered by provider name.
In the example above if we filter by two criteria which are individually indexed but not a part of a composite index then the database engine does something called the merge join. Here the index scans happen for each of indexes and then the results are combined together. This needs more operations than a simple table scan or index seek but definitely better than no indexing at all.
Exceptions to covering indexes
When an expression such as a negation or arithmetic operation is performed or some other similar operations are performed on an indexed key then filtering may happen by scan of the whole index. For example if we do
SELECT WorkOrderID FROM WorkOrder where ProductID + 2 = 800
Where ProductID has been indexed, the engine has to evaluate the expression for each of the rows and then filtered out. Although for this specific case we may rewrite the query as
SELECT WorkOrderID FROM WorkOder WHERE ProductID = 800 – 2
And the index can be used here, for the case of negation it is generally required to prove that a certain index does not exist rather than it exists and hence generally not very easy to optimize.
Similar is the case for queries containing wild-cards. For example if we need to find all work order ids ending with a 5, potentially the whole table needs a scan.
For queries which apply a function to a data before filtering the expression must be evaluated first before filtering can happen and hence not very fast.
These type of modifications to the indexed arguments are called the Non-SARGable (non-searchable arguments).