General discussion of SQL Topics; aimed for the novice-intermediate level Microsoft SQL Server User. Currently focuses on using SQL Server 2005.

Showing posts with label I/O. Show all posts
Showing posts with label I/O. Show all posts

Thursday, May 22, 2008

Optimizing Indexes

Summary:

Optimizing Indexes is quite a complicated subject. There are many techniques; and what is the most difficult is that there are so many different scenarios as to when and how to optimize your indexes. Instead of trying to describe a method of how to optimize, I am going to discuss how to identify when optimizing indexes should be considered and where to go to find the different options.

The first step is NOT to just rebuild all indexes, or to defrag them all. There are some people that believe that rebuilding all indexes is the way to solve this; don't listen to them!! PLEASE DON"T!!! I'm rarely the person to give advice on not listening to other people; most people can help you learn about what's good or bad...but if you go the route of rebuilding and defraging all your indexes, you can in fact be causing additional problems that you didn't have to cause. In particular using system resources that don't need to be used, and causing unneeded fragmentation on the OS level! Any decent DBA is always concerned with using system resources, remember that system resources is always a precious commodity and should never be wasted.

The first step is to analyze the database. The next step is to identify the queries that can/will benefit from optimizing the index(es). The third step is to optimize the index. The final step is to periodically review these methods to identify when optimizing indexes will be required; this will be an ongoing project and will ultimately require re-iterrating through your queries to identify when an index has become warranted for optimization. There is no cut and dry method to warn you, prevent indexes from becoming out of tune, or to auto-optimize the queries and indexes.

Analyzing the Database:

So, how do you analyze a database to determine if an index needs to be rebuilt or should run a defrag on the index? Well, first you need to understand how indexes are built and what is causing them to become fragmented. Review my blog entry "Introduction To Indexes" to learn about how indexes are built. Also, take the time to review this white paper called "Microsoft SQL Server 2000 Index Defragmentation Best Practices"; even though it is written for SQL Server 2000, it will still apply the same to SQL Server 2005. It is a long read, but will help out well beyond the scope of this blog entry.

Here, in a nutshell, is how to determine when your index is fragmented. Use SQL Profiler to identify poor performing queries; in particular use the template "SQLProfilerTSQL_Duration", this will already contain the traces needed to identify the offending queries. Once the queries are identified you can then start looking into which indexes there queries are accessing; especially queries that are retrieving ranges instead of singletons. These queries are the highest risk of having fragmented indexes, remember though that this is just a method to determine the potential problem areas. Your own judgement will be best to make the final determination.

When to Consider Optimizing Indexes:

The first key in detecting that you should optimize your index is when you are observing performance degradation and have no absolute culprit to cause this. Now, keep in mind that just because you are seeing performance degradation doesn't immediately indicate that there is index fragmentation. This can only be determined by analyzing your database properly. Next, identify which queries are utilizing the most I/O; these are the next candidates. Then consider queries that have workloads that are suspected to benefit from index defrags; this can be quite difficult and should be carefully determined.

All of these are covered in the MSDN white paper called "Microsoft SQL Server 2000 Index Defragmentation Best Practices"; mentioned earlier. This paper even goes into very good detail on how to make these determinations, which will probably be enough to get your through while you gain experience.

Getting Help:

Sometimes it's nice to have someone more experienced help out, if that is how you feel then go with that feeling and seek out that someone.

If you don't have physical access to an experienced DBA then seek advice from trusted websites, forums, and/or discussion groups. Remember when seeking advice they can only provide an answer with quality that matches the information you provided. Such as stating you have a database that needs indexes rebuilt or defragged, will most likely get you answers stating to use DBCC commands or some other commonly used index rebuilding command. The purpose of seeking advice is to provide detailed information to get an answer that is specific to your scenario; so be sure to provide as much information as possible without breaking any company policies.

Conclusion:

As you can see, Indexes can be fairly simple to optimize. It's determining when to optimize and what to optimize that becomes difficult.

There are methods to fine-tune the auto-index handling of your indexes, this is covered in the white paper I mention in the next section. You'll always find different opinions and experiences, embrace all you can and mix and match what works best for you and your situation. There is no one-size fits all for optimizing indexes, just as there isn't one-size fits all for database solutions. It's all about customizing to your needs, and utilizing your available resources to make your work easier and more enjoyable.

I can't stress enough that reading the "Microsoft SQL Server 2000 Index Defragmentation Best Practices" white paper will help out tremendously. This paper covers the topic so well, that originally I had planned on providing tips on how to identify the queries that required indexes to be rebuilt, and how to find additional help on this topic. During my research I came across this white paper and it covered absolutely everything and more than I had planned to cover!

Until next time, Happy Coding!!

Additional Resources:

Microsoft SQL Server 2000 Index Defragmentation Best Practices (http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx)
SQL Server Best Practices (http://msdn.microsoft.com/en-us/sqlserver/bb671432.aspx)
How To: Optimize SQL Indexes (http://msdn.microsoft.com/en-us/library/ms979195.aspx)
Database Journal: Reducing SQL Server Index Fragmentation (http://www.databasejournal.com/features/mssql/article.php/2238211)
SQLServer Performance: SQL Server Index Fragmentation and Its Resolution (http://www.sql-server-performance.com/articles/per/index_fragmentation_p1.aspx)

Friday, May 2, 2008

Introduction To Indexes

Summary:

This is a basic introduction to what Indexes are and how to determine what should be indexed. I will cover the basic concept behind indexes, how they are intended to be helpful, why you would want to use them, and how to determine what should be included in your index.

I’ll also cover the differences between Clustered and Non-Clustered indexes, and provide some tips that will help you to know how to differentiate when to use which kind of index, as well as an analogy to break down the differences between Clustered and Non-Clustered indexes in simple terms.

What indexes are and why to use them:

Indexes are intended to help you to efficiently find information within your table; they are meant help you to lower the amount of CPU resources needed to find this information, and also to minimize the amount of Input/Output (I/O) used to access this information. All of this can result in a much faster result being returned.

Indexes in SQL Server can be thought of in a similar fashion as an index in the back of a book or a table of contents in the front of the book. In the book and Index is intended to allow you to specify a word that you are interested in finding and point you to the page(s) that word has been referenced. A Table Of Contents can be used to specify a topic you are interested in, and will point you to the section covering that topic; some table of contents will even point you to sub-sections that refine the context of the topic that can help you to more accurately focus your reading on relevant information.

Indexes within SQL Server are designed to perform these same functions, and provide the same helpful information. Indexes are created either through a T-SQL command or through a form of interface application that connects directly to the database and supports T-SQL commands (such as SQL Server Management Studio, or Enterprise Management, etc). NOTE: I will cover the topics of “Creating Indexes” and “Tuning Indexes” in separate blog entries. Once indexes are created SQL Server will automatically use them to ‘help’ in returning result sets; there are no options you must specify to take advantage of the created indexes…this is all built-in to SQL Server.

There are some minor differences in SQL Server 2000 and SQL Server 2005; since I primarily use SQL Server 2005 my discussion is based on this version. However, all of this information can be confirmed with the use of Books Online (BOL) for SQL Server 2000. I will try to avoid using specific information that can’t be used in SQL Server 2000; however, I am unable to guarantee that all information applies to previous SQL Server versions. Please use BOL if you are unsure of any of this information will work with your version of SQL Server.

How SQL Server Retrieves Data without Indexes:

Before we can go into how an index works, we should really understand how SQL Server finds data without the indexes; this will illustrate the importance of using Indexes. My examples will be based on a Customer Table that holds basic customer information such as the customer’s first and last name, their street address, city, state and zip code. When I use the term “Customer” I am referring to the Customer Table; this also goes for “First Name”, “Last Name”, etc. These are referencing their respective columns within the Customer Table.

What happens when SQL Server attempts to retrieve data from a table that does not contain any indexes is called a “Table Scan”. This is where SQL Server will go through each record (column by column, then row by row) to find matching records. After it has gone through the entire table it will return any matching results it has found. As you can probably image this is no quick task. This would be similar to picking up a book and deciding to find any pages that contain the word “alligator” in it. There might be a page with that word, there might now. In either case, without an index or table of contents you’d have to flip through every page to see if it contains this search word. If the book is under 30 pages, this may not be too bad…but, what if the book is 1500+ pages; that’s a whole different story.

This brings us to Indexes to help lower the amount of time SQL Server needs to spend finding matches in results!

Different Types of Indexes (and the general structure of an index):

There are two basic types of indexes: Clustered and Non-Clustered. Before I get into the specifics of each type and the differences between the types I want to cover the basic structure of an index; and the minor structural differences between the two types of indexes.

Basically all indexes are formulated with what is called “index key(s)” for a column (or combination of columns). These basically are pointers that tell SQL Server were specific words (or data) is contained within a table. Each table is broken into pages (this is physical storage of the data; each page is 8 KB and typically is constructed based on the order of entry…not a sort order). A page can contain hundreds of records, or just a few records. Because of the size limit in a data page, it all depends on the data types and how many columns are being stored for each record. There are formulas that can pin point the number of records being stored on a data page; this is beyond the scope of this blog entry.

Each index key(s) has a definition of the page the indexed word (or data) is stored. SQL Server will use these keys to determine what page to go to; so if the search term shows up in data page # 4 of 27 pages, SQL will skip over the first 3 pages and search page # 4. It will halt its search if there are no other keys specifying that data is stored in other pages; there are often times cases where data will be stored through multiple pages (like page #s 4, 12, 13, 21, 25, 26) or a data record could span multiple pages (such as page #4 & 5). Again, this is beyond the scope of this blog entry. The important thing to know is that data isn’t naturally stored in logical order within the physical file; this is why we need to have indexes to speed up the search. Also, since searches aren’t pre-determined when data is being entered SQL cannot have the data sorted, and in many cases you may have a specific method of sorting the data…but you might not be searching the sorted data (such as in Customer table, you might sort data based on customer’s location like City/State…but want to search for customer’s with Last Name starting with R; this type of a query and sort method can’t be pre-determined…hence the difficulty in storing data logically). Data is basically stored on a First In basis, and the rows just sequentially increase as data is entered, or removed. We need searches to tell us where the data is so that we can easily find this data to match our unknown (at entry time) search criteria and sorting methods.

This brings us to how do we tell SQL Server the best method to search if we don’t know it ourselves?! The answer is Clustered and Non-Clustered indexes, some research and knowledge of the data being stored, and a lot of testing (and refined Tuning as the database is used).

Clustered indexes:

Clustered indexes are a logical sorting and storing of index keys. This can allow SQL Server to very efficiently find data. This is all based on the defined columns within the clustered index. So, why don’t we want to just have every column within every table to be included in the clustered index? Because the more indexes included in the clustered index takes away from performance for your INSERTs, UPDATEs, and DELETE statements.

Basically the idea is that Indexes, especially Clustered, are wonderful to increase the speed to obtain results for Queries and Reports. However, there is a tradeoff which is that it takes more time write and/or modify data within the table. This is because the clustered index needs to update itself.

Imaging a file cabinet holding your customer information, you originally store all data by the customer’s Last name and then the First name. Now, you decide well, for getting information it would be quicker to store all of the data within each file by the date the data was added (and incase of similar dates then by alphabetical titles). So, now you need to find a customer John Smith and a paper that hold his Personal Address information…you look under Smith, find John, then in his folder look up Address section, then Personal and you quickly get the information.

Now, what if you have to update his file to include his birthdate, so that might go under Personal Information…not personal address though! So, now you find the last name, first name, then information section, then personal information page and add the file. Wouldn’t it have been quicker to just find his file and add her personal information to the end of his folder? That’s where the balance comes in, so question is partly how well your system performs with just reading queries/reports versus entering/modifying information…but the other part is what will be done more often. Are you going to query the information more often or are you going to update/modify the information more often? If you are querying, then siding on extra columns in your cluster MIGHT be ok; but if you are inserting/modifying data then you want only to use the number of columns it is to filter the result sets into manageable sections (maybe filter only name; last name, then first name…or maybe name and location; such as last name, first name, then city/state). It all just depends on what you are doing, how much data is being stored, and how you anticipate accessing this data will occur (i.e. often, seldom, with reports, lots of updates, lots of inserts, few inserts, etc).

Now, an important note is that you can ONLY have 1 clustered index per table! So, this can bring up the point of where you want the data to be accessible, but you don’t want to bog down the system each time you need to add/modify/delete data. So, how do you do this? How do you walk this fine line of optimal performance? Here come non-clustered indexes to the rescue!

Non-Clustered indexes:

You can have up to 249 Non-Clustered indexes; however, just as with Clustered Indexes there is such as thing as too many! Non-clustered indexes have its intended usage, which will get covered in the Tips section.

Non-Clustered indexes are indexes that aren’t sorted at the physical data page layer. This means that SQL Server can be pointed to the data page containing the data matching the search page; but the index stops after that point; it then becomes up to SQL to search that page and pick out the data. Remember Clustered Indexes point to the exact location of the data, and is quick because it is sorted. Non-Clustered indexes only point to the page; the Index Keys in the index are sorted also…but not the logical locations. So, there is some performance loss between Clustered and Non-Clustered. Always try to think of non-clustered indexes as an alternative to listing every column in a clustered index, and a method to allow for data that is less accessed or returns more Exact matched results.

So what does all of this mean? Well, basically if you were to tell SQL Server to find results of last names that start with ‘RE’ and used a non-clustered index to index the last name column then the index would point you to the page(s) that contain the last name(s) starting with “RE”, but there could be names also that start with “RA” or “RI” or “RH”, it all depends on what is actually stored at the time the query is executed. This then means that SQL would go through these other results until it finds the matching result; this is still faster than an entire table scan (which a table scan would start with Last Name “A” and End with Last Name “Z”).

Tips (putting this all together):

Now, that we have an understanding of what Clustered and Non-Clustered indexes are, and how SQL uses them let’s look at how we can determine what we should index (and what type of index to use). Remember:

  • · A table can only have a single Clustered Index
  • · Up to 249 non-clustered indexes
  • · Clustered Indexes point to the exact location of matching results
  • · Non-Clustered Indexes only point to the data page containing matching results, so some scanning will occur upon searching for matching results (however, much less scanning than not having any indexes)
  • · Clustered Indexes can retrieve data much more efficiently; BUT have a cost of slower data INSERT, UPDATE, and DELETE
  • · Non-Clustered indexes will use index key(s) if available, if none are available it will then use the data page ID or Row ID to perform its search for matching data

So now let’s put this all together into a simple understandable analogy. I will use a Library as our ‘database’, each section of the library (such as Romance, History, Computers, etc) will be our ‘Tables’, and of course each book will be our ‘record, or row’.

A catalog will contain two types of information. A ‘clustered’ catalog that will utilize the Dewey Decimal System; and is stored in the index box in Alphabetical Order by Book Title, followed by Author Name. A ‘non-clustered’ catalog will simply tell us the Section (‘Table’), the Shelf Number where top shelf is # 1 and bottom shelf is # 5 (This simulates our ‘data pages’), and finally the title of the book and the author (this is our First Name and Last Name columns). This is not sorted but does contain a listing of all book titles with Author Names that we can look at on a separate sheet to quickly get the index card # it is stored on.

Now with this in mind; imagine searching for a book called “James’ Awesome T-SQL Blog Book” by James R (no such book by the way exists). Now if we did this with a ‘clustered’ catalog. We could quickly look in the Title section and find the book that matches, if there were two books with this title we could then further define the result by looking at the author name.

Now imagine this same scenario but with the ‘non-clustered’ index. Let’s say the book title is somewhere around index card # 100 of 15000. Since we can quickly review the index cards listed we can see the card is around #100. So, we open the index drawer…we find our book information and now we are off.

Since the ‘clustered’ index card includes the Dewey Decimal System we can simply find the closest matching book numbered and quickly jump up or down as need to the exact book. Now with the ‘non-clustered’ index information we only know the section, title and shelf number. Now we get to the proper shelf, we then go through each book one at time to find the proper book. This of course is much quicker than starting at the beginning of the section and looking through each book, or even worse at the beginning of the first book in the library and sequentially going through the entire collection of books until we found the information (which may not even be there!).

Choosing Type of Index:

When to choose Clustered index and when to choose Non-Clustered indexes…

When to choose Clustered Indexes:

  • · Columns that are frequently accessed
  • · Columns intended to be stored sequentially (such as Last Name, First Name, etc)
  • · Columns that will be queried in ranges (when you use WHERE clause with the BETWEEN, >, or < type of operators)

Once you have determined the columns that should be in your clustered index it then becomes ESPECIALLY CRITICAL to consider what type of queries will be ran most often and what queries MUST have optimal performance. If these columns are not within these queries then you may want to reconsider revising your clustered index NOT to include unnecessary columns (consider creating non-clustered index for these columns). If you find that you require columns within these queries that meet the above Index selection tips and they are not already included, and then consider revising your index to include these columns.

When to choose Non-Clustered Indexes:

  • · Columns frequently in the WHERE clause that return EXACT matches
  • · Columns that contain many distinct values (Such as Last Name, or Street Address; but not City, State or Zip Code).
  • · Queries that do NOT return large result sets
  • · Especially in columns that are needed for critical queries, and aren’t already covered in your Clustered Index; or may be queried in a non-sequential manner that isn’t already covered in your Clustered Index

Keep in mind that once you have created your Indexes, rather it be Clustered or Non-Clustered you can always revise them to meet your current needs. Also, remember that your needs will change and this often times will require revising your Indexes to meet those needs!

Conclusion:

Indexes are here to help you; however they are a complicated concept to master. In most cases it is easiest to figure out what columns qualify for being indexed and then to simply try them out. Don’t be afraid to try out a few different combinations of Clustered and Non-Clustered Indexes; performance is never a one size fits all. Your indexes should not be determined with that type of mind set.

I would also suggest setting a personal reminder or adding to your occasional checklist to check your indexes and how the table is accessing your indexes and to see if there is a way to improve the performance of your queries with the indexes for that table. Indexes will always be evolving, your queries will always be evolving, and your tables will always be evolving…don’t let me evolve without your intervention.

Remember that Speed = Happy Users…to me a Happy User = a Happy DBA! =)

Until next time, happy coding!