Why use the INCLUDE clause when creating an index?

Question

While studying for the 70-433 exam I noticed you can create a covering index in one of the following two ways.

CREATE INDEX idx1 ON MyTable (Col1, Col2, Col3)

-- OR --

CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)

The INCLUDE clause is new to me. Why would you use it and what guidelines would you suggest in determining whether to create a covering index with or without the INCLUDE clause?

JonH · Accepted Answer · 2020-11-19 18:43:58Z

485

If the column is not in the WHERE/JOIN/GROUP BY/ORDER BY, but only in the column list in the SELECT clause is where you use INCLUDE.

The INCLUDE clause adds the data at the lowest/leaf level, rather than in the index tree. This makes the index smaller because it's not part of the tree

INCLUDE columns are not key columns in the index, so they are not ordered. This means it isn't really useful for predicates, sorting etc as I mentioned above. However, it may be useful if you have a residual lookup in a few rows from the key column(s)

Another MSDN article with a worked example

edited Nov 19, 2020 at 18:43

JonH

33k13 gold badges91 silver badges147 bronze badges

answered Aug 20, 2009 at 18:31

gbn

429k83 gold badges595 silver badges683 bronze badges

9

So then, this would be a technique for creating a less expensive version of a covered index?
– JMarsch
Commented Sep 15, 2012 at 2:56
5

@gbn, would you mind explaining this sentence in more detail, and explain why it means that the include clause is not useful for sorting, etc: "The INCLUDE clause adds the data at the lowest/leaf level, rather than in the index tree. This makes the index smaller because it's not part of the tree"
– Tola Odejayi
Commented May 7, 2013 at 21:20
5

@JMarsch: sorry for the late reply, but yes, this is exactly what it is.
– gbn
Commented May 8, 2013 at 7:45
11

@Tola Odejayi: INCLUDE columns are not key columns in the index, so they are not ordered. This makes them not typically useful for JOINs or sorting. And because they are not key columns, they don't sit in the whole B-tree structure like key columns
– gbn
Commented May 8, 2013 at 7:51
6

While this is the most accepted answer, I think further explanation is needed, what if for some queries the column is part of the SELECT and for some not?\
– Chisko
Commented Mar 8, 2017 at 3:37

| Show 2 more comments

xorinzor · Accepted Answer · 2023-01-11 12:45:15Z

269

You would use the INCLUDE to add one or more columns to the leaf level of a non-clustered index, if by doing so, you can "cover" your queries.

Imagine you need to query for an employee's ID, department ID, and lastname.

SELECT EmployeeID, DepartmentID, LastName
FROM Employee
WHERE DepartmentID = 5

If you happen to have a non-clustered index on (EmployeeID, DepartmentID), once you find the employees for a given department, you now have to do "bookmark lookup" to get the actual full employee record, just to get the lastname column. That can get pretty expensive in terms of performance, if you find a lot of employees.

If you had included that lastname in your index:

CREATE NONCLUSTERED INDEX NC_EmpDep 
  ON Employee(DepartmentID)
  INCLUDE (Lastname, EmployeeID)

then all the information you need is available in the leaf level of the non-clustered index. Just by seeking in the non-clustered index and finding your employees for a given department, you have all the necessary information, and the bookmark lookup for each employee found in the index is no longer necessary --> you save a lot of time.

Obviously, you cannot include every column in every non-clustered index - but if you do have queries which are missing just one or two columns to be "covered" (and that get used a lot), it can be very helpful to INCLUDE those into a suitable non-clustered index.

edited Jan 11, 2023 at 12:45

xorinzor

6,36710 gold badges42 silver badges73 bronze badges

answered Aug 20, 2009 at 19:27

marc_s

748k180 gold badges1.4k silver badges1.5k bronze badges

27

Are you sure you'd use this index? Why EmployeeID? You only need DepartmentID in the key columns? You have been quoted here as authoratitive: stackoverflow.com/q/6187904/27535
– gbn
Commented May 31, 2011 at 13:05
4

Your explanation is good but doesn't actually line up with the use case that you outline. The key column(s) should be on the filter or JOIN keys in the query, and the INCLUDEs need to be the data you are retrieving but not sorting.
– JNK
Commented Feb 1, 2012 at 13:53
22

First of all the index Employee(EmployeeID, DepartmentID) will not be used to filter DepartmentID = 5. Because its order is not matching
– AnandPhadke
Commented Apr 2, 2013 at 11:38
It's been a decade but this still comes up high in search results. Agree with the criticism of the example even though the description seems accurate, so downvoting because the example SELECT is so blatantly inconsistent with the suggested index.
– EGP
Commented Jan 1, 2023 at 17:05
I edited the answer to reflect the correct index that should have been created (as also pointed out by @AnandPhadke)
– xorinzor
Commented Jan 11, 2023 at 12:47

Add a comment |

kevinbatchcom · Accepted Answer · 2016-07-07 23:08:40Z

This discussion is missing out on the important point: The question is not if the "non-key-columns" are better to include as index-columns or as included-columns.

The question is how expensive it is to use the include-mechanism to include columns that are not really needed in index? (typically not part of where-clauses, but often included in selects). So your dilemma is always:

Use index on id1, id2 ... idN alone or
Use index on id1, id2 ... idN plus include col1, col2 ... colN

Where: id1, id2 ... idN are columns often used in restrictions and col1, col2 ... colN are columns often selected, but typically not used in restrictions

(The option to include all of these columns as part of the index-key is just always silly (unless they are also used in restrictions) - cause it would always be more expensive to maintain since the index must be updated and sorted even when the "keys" have not changed).

So use option 1 or 2?

Answer: If your table is rarely updated - mostly inserted into/deleted from - then it is relatively inexpensive to use the include-mechanism to include some "hot columns" (that are often used in selects - but not often used on restrictions) since inserts/deletes require the index to be updated/sorted anyway and thus little extra overhead is associated with storing off a few extra columns while already updating the index. The overhead is the extra memory and CPU used to store redundant info on the index.

If the columns you consider to add as included-columns are often updated (without the index-key-columns being updated) - or - if it is so many of them that the index becomes close to a copy of your table - use option 1 I'd suggest! Also if adding certain include-column(s) turns out to make no performance-difference - you might want to skip the idea of adding them:) Verify that they are useful!

The average number of rows per same values in keys (id1, id2 ... idN) can be of some importance as well.

Notice that if a column - that is added as an included-column of index - is used in the restriction: As long as the index as such can be used (based on restriction against index-key-columns) - then SQL Server is matching the column-restriction against the index (leaf-node-values) instead of going the expensive way around the table itself.

I am using a column that heavily used. When I do "INCLUDE" it speeds up my queries for more than twice (yes, simply used INCLUDE). However, the column is also being updated regularly. Am I doing a wrong thing? — Sam, Commented Jun 23, 2022 at 10:54
@Sam Check performance difference on updates/inserts and deletes and this will answer your question — Mariusz, Commented Sep 28, 2022 at 14:53
@Mariusz but how to measure this performance? it is ok when the data is a few thousand, but when it is a few hundred million, then it may affect the outcome (hence the question). It is not practical to remove the index and test it and reindex it. Therefore, I am asking an expert opinion... — Sam, Commented Oct 1, 2022 at 4:10
@Sam What I was trying to say is to test the system as close to real-life as possible. If you expect millions of records, you can create them programmatically and run your tests. You know best what you are building — Mariusz, Commented Oct 4, 2022 at 14:20
If you halve your query time with the include, then as long as the insert/update/delete performance is acceptable both now and for forseeable future needs, then of course use the include and don't overthink it. Time spent on optimization that isn't reasonably needed is wasted time. That said, be sure that the halving of performance isn't merely do to some other factor, such as statistics being updated when you add the index after previously being stale. — EGP, Commented Jan 1, 2023 at 17:09

onupdatecascade · Accepted Answer · 2009-08-20 18:53:30Z

Basic index columns are sorted, but included columns are not sorted. This saves resources in maintaining the index, while still making it possible to provide the data in the included columns to cover a query. So, if you want to cover queries, you can put the search criteria to locate rows into the sorted columns of the index, but then "include" additional, unsorted columns with non-search data. It definitely helps with reducing the amount of sorting and fragmentation in index maintenance.

Markus Winand · Accepted Answer · 2019-05-30 10:53:05Z

One reason to prefer INCLUDE over key-columns if you don't need that column in the key is documentation. That makes evolving indexes much more easy in the future.

Considering your example:

CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)

That index is best if your query looks like this:

SELECT col2, col3
  FROM MyTable
 WHERE col1 = ...

Of course you should not put columns in INCLUDE if you can get an additional benefit from having them in the key part. Both of the following queries would actually prefer the col2 column in the key of the index.

SELECT col2, col3
  FROM MyTable
 WHERE col1 = ...
   AND col2 = ...

SELECT TOP 1 col2, col3
  FROM MyTable
 WHERE col1 = ...
 ORDER BY col2

Let's assume this is not the case and we have col2 in the INCLUDE clause because there is just no benefit of having it in the tree part of the index.

Fast forward some years.

You need to tune this query:

SELECT TOP 1 col2
  FROM MyTable
 WHERE col1 = ...
 ORDER BY another_col

To optimize that query, the following index would be great:

CREATE INDEX idx1 ON MyTable (Col1, another_col) INCLUDE (Col2)

If you check what indexes you have on that table already, your previous index might still be there:

CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)

Now you know that Col2 and Col3 are not part of the index tree and are thus not used to narrow the read index range nor for ordering the rows. Is is rather safe to add another_column to the end of the key-part of the index (after col1). There is little risk to break anything:

DROP INDEX idx1 ON MyTable;
CREATE INDEX idx1 ON MyTable (Col1, another_col) INCLUDE (Col2, Col3);

That index will become bigger, which still has some risks, but it is generally better to extend existing indexes compared to introducing new ones.

If you would have an index without INCLUDE, you could not know what queries you would break by adding another_col right after Col1.

CREATE INDEX idx1 ON MyTable (Col1, Col2, Col3)

What happens if you add another_col between Col1 and Col2? Will other queries suffer?

There are other "benefits" of INCLUDE vs. key columns if you add those columns just to avoid fetching them from the table. However, I consider the documentation aspect the most important one.

To answer your question:

what guidelines would you suggest in determining whether to create a covering index with or without the INCLUDE clause?

If you add a column to the index for the sole purpose to have that column available in the index without visiting the table, put it into the INCLUDE clause.

If adding the column to the index key brings additional benefits (e.g. for order by or because it can narrow the read index range) add it to the key.

You can read a longer discussion about this here:

https://use-the-index-luke.com/blog/2019-04/include-columns-in-btree-indexes

mrdenny · Accepted Answer · 2009-08-22 05:08:40Z

7

The reasons why (including the data in the leaf level of the index) have been nicely explained. The reason that you give two shakes about this, is that when you run your query, if you don't have the additional columns included (new feature in SQL 2005) the SQL Server has to go to the clustered index to get the additional columns which takes more time, and adds more load to the SQL Server service, the disks, and the memory (buffer cache to be specific) as new data pages are loaded into memory, potentially pushing other more often needed data out of the buffer cache.

answered Aug 22, 2009 at 5:08

mrdenny

5,0382 gold badges21 silver badges29 bronze badges

is there a way to prove that it is actually using less memory? it's what i'd expect too but i'm getting some static about this at work
– Asken
Commented Nov 16, 2012 at 15:06
Given that you have to load the page from the heap or clustered index into memory as well as the index page which means that you are putting duplicate data into memory the math becomes pretty simple. As for a way to specifically measure it, no there's not.
– mrdenny
Commented Nov 16, 2012 at 23:50

Add a comment |

double-beep · Accepted Answer · 2024-06-18 14:56:34Z

7

Included columns can be of data types that are not allowed as index key columns, such as varchar(max).

This allows you to include such columns in a covering index. I recently had to do this to provide a Hibernate generated query, which had a lot of columns in the SELECT, with a useful index.

edited Jun 18 at 14:56

double-beep

5,35819 gold badges37 silver badges45 bronze badges

answered Oct 21, 2013 at 11:03

Nibor

1,1168 silver badges11 bronze badges

Add a comment |

mEmENT0m0RI · Accepted Answer · 2011-03-01 02:24:59Z

There is a limit to the total size of all columns inlined into the index definition. That said though, I have never had to create index that wide. To me, the bigger advantage is the fact that you can cover more queries with one index that has included columns as they don't have to be defined in any particular order. Think about is as an index within the index. One example would be the StoreID (where StoreID is low selectivity meaning that each store is associated with a lot of customers) and then customer demographics data (LastName, FirstName, DOB): If you just inline those columns in this order (StoreID, LastName, FirstName, DOB), you can only efficiently search for customers for which you know StoreID and LastName.

On the other hand, defining the index on StoreID and including LastName, FirstName, DOB columns would let you in essence do two seeks- index predicate on StoreID and then seek predicate on any of the included columns. This would let you cover all possible search permutationsas as long as it starts with StoreID.

Collectives™ on Stack Overflow

Why use the INCLUDE clause when creating an index?

8 Answers 8

Not the answer you're looking for? Browse other questions tagged
sql-server
sql-server-2008
sql-server-2005
indexing
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Not the answer you're looking for? Browse other questions tagged sql-serversql-server-2008sql-server-2005indexing or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
sql-server
sql-server-2008
sql-server-2005
indexing
or ask your own question.