As data engineers, our work time is mostly consumed with writing queries that produce the desired output or with designing systems that can properly capture the required data. However, we frequently run across situations where we also need to optimize our queries for efficiency. As entry level data engineers, (or even when we have gained much experience) we might tend to believe that all queries that produce equal output must be equivalent in resource consumption. This is a misconception most of us learn the hard way.
Without adequate of knowledge of how SQL servers internally work and tools which help us visualize it, our efforts might be confined to ad-hoc hit-and-trial experiments with the query. Obviously, there are various ways in which the same result might be produced but there are a number of difficulties with this approach
- The sheer number of possibilities might be so large that we might have to invest considerable time to find an optimal solution. Furthermore, we might not know why a certain solution works in the end even if we do.
- As the data changes our query may need to change again and we are back where we were initially.
- We will have difficulties replicating our success or to transfer what we have learnt.
We cannot call these ad-hoc efforts as engineering. Going one level up we may even learn a few proven techniques that can be documented and that work but we would still not know why. We must approach the art of query optimization in a methodological way and the base of it starts with knowing how indexes work and how to use tools to view how the SQL server engine is really translating our queries internally.
In this article I will present how indexes work and how to view query execution plans in Microsoft SQL server 2008.
To understand this article you will need to have the following installed in your system
- MSSQL server 2008
- MSSQL server 2008 management studio
- AdventureWorks or similar database to play with the data
Query Execution Plan Basics
When you present a certain query to the SQL server it translates the query into an internal representation. The SQL server does not need to literally interpret the SQL query and produce the result. The SQL server internally operates with a set of primitive operations which is typically less than the number of operations that can be specified by using a SQL statement. These primitive operations are combined in certain sequences to satisfy the desired output.
The SQL server in fact might have various alternatives in which the query might be satisfied. For example a query such as
SELECT * FROM claim_details WHERE insured_id=’1234’ AND place_of_service=’abc’
Can be satisfied by first searching all the claim details for insured_id value 1234 and then filtering the result out for only those that contain place of service abc or the opposite way. This might seem trivial but there can be situations in which these two alternatives yield drastically different query time.
The part of the SQL engine which generates different alternatives and then chooses one that is likely to be most efficient is called the query optimizer. One the SQL query is received it is parsed and then a number of alternatives are generated by the optimizer. Each alternative is a sequence of primitive operations. These are called the execution plans.
Each plan will potentially have a different execution time based on the number of read/write operations, consumption of memory and CPU and so on. The query optimizer must select a plan that does the job in least time with least resource consumption. It is obvious that to predict it the optimizer needs some statistical information about the data. To give a trivial example a very small table would need very less time to read the whole data whereas for a very big table reading only a certain number of rows is better even if it means allocating figure out which rows to read.
figure1: plan costs calculated by the query optimizer
Collecting of statistics, like the number of rows in a table and so on itself consumes some time and it is best to reuse the pre-calculated information if it exists. Based on the statistics then the SQL optimizer calculates cost for each step in the execution plan. These cost reflect the relative expense of running the step and doesn’t have units. If the cost is higher for one step than other possibility it means it will likely take lesser time/resource etc. to run. The total cost of a plan is the sum of cost of all the steps.
As mentioned the statistics used to calculate the costs might be out of date. During the actual execution the plans might be altered by the execution engine if this is the case. Hence the calculated plan and the exact plan might be different. The calculated plan is called the estimated execution plan and the actual plan (which is not known before the query runs) is called the actual execution plan.
Viewing Query Execution Plans
The query execution plan can be viewed in the following different ways
- Using SQL server management studio to view estimated and actual query execution plans in graphical format
- Using showplans to return the execution plan in message or result set
- Using SQL Profiler to capture execution plans in plain text or XML
- Query plans in cache may be viewed using dynamic management views.
figure 2: viewing execution plan in SQL server management studio
In the figure above the execution plan is shown for a query using the “Display Estimated Execution Plan Button”.
The statistics and various estimated costs for each of the steps can be viewed by right-clicking the boxes in the diagram pane and then selecting properties.
Imagine a real world problem of storing information like names, addresses, phone numbers and other information for all the people in a country. One crude way might be to list information pertaining to an individual in a random way. However, if we need to the name of a person with a certain phone number we will have a few options but to start searching from the start of the list till we find the person. This is the way a normal SQL table without any indexes would work. To find any data we will need to start from beginning and walk through the list. This is a nightmare in case we need to plough through several gigabits.
A normal telephone directory however makes this slightly easier. We can sort the data in ascending order of names as shown in figure 3. For our pupose if we imagine a telephone directory which oderers by telephone number first concepts might be easier to explain. Hence, if we need to find a certain telephone number we can do it much quickly following a dictionary lookup.
figure 3a: Our telephone directory modified to list telephone numbers first
figure 3b: A typical telephone directory
The way indexing in done in the telephone directory is a bit different than what is done in the MSSQL server. An index is maintained in a balanced tree in the case of SQL server. To explain a b-tree we can imagine a list in the beginning of the telephone directory. The first entry in the list would say something like
For telephone numbers 000000 – 555555 turn to page 2
For telephone numbers 555556 – 999999 turn to page 2,000
Then again on page 2 we have
For telephone numbers 000000 – 277777 turn to page 3
For telephone numbers 277778 – 555555 turn to page 1,000
Finally the range will get smaller and smaller until we can point to a single page with phone numbers and other information in it.
This is exactly like how the MSSQL server maintains its indexes.
figure 4: A b-tree for a non-clustered index
In SQL language we call the telephone number field in our telephone directory to be cluster indexed. This means that there is a strategy for finding telephone number quickly (i.e through a b-tree) and all other information related to the phone number (e.g. person’s first name, last name etc.) are listed right beside the phone number. Thus at the leaf of the cluster index b-tree we have all the fields of the table also listed.
The clustered index is the physical sort order of the table and since there can be only one physical sort order for any table there can be only one clustered index. When we create a primary key we are telling the SQL server to also create a clustered index on those fields.
For cases where we need to find other information for a certain telephone number a our modified telephone directory is enough but for cases where we need to find phone numbers for a given first name it is almost useless. To resolve this we might be tempted to create a separate directory which lists data on first name basis.
In SQL server similar (not exact) approach follows. The server maintains a b-tree of the first names but at the leaf instead of attaching details like phone numbers, last names etc. it will only point to telephone numbers, which is a clustered index. This is called a non-clustered index.
Thus if you have created a non-clustered index on first name and you want to find the full name of all the people whose first name is Paul you will have to first scan the index for first names and then when you have found the telephone numbers corresponding to the first name you look up in the clustered index table to find the last names.