0

I was asked for a demonstration of how searching on an indexed column is faster than searching for a string prefix, so I created a quick test but the results were surprising and I can't see why.

The database consists of one table (Products) with productName and Brand columns and a few other columns, just to bulk up the data with an index on Brand:

CREATE TABLE [dbo].[Products]
(
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [ProductName] [nvarchar](500) NOT NULL,
    [Brand] [nvarchar](100) NOT NULL,
    [Field1] [nvarchar](50) NULL,
    [Field2] [nvarchar](50) NULL,
    [Field3] [nvarchar](50) NULL,
    [Field4] [nvarchar](50) NULL,
    [Field5] [nvarchar](50) NULL,
    Ix_Brands index(Brand),

    CONSTRAINT [PK_Products] 
        PRIMARY KEY CLUSTERED ([ID] ASC)
)

I then get the products by brand using 4 different methods and time how long each one takes.

using System;
using System.Data;
using System.Data.SqlClient;
using System.Linq;

namespace SpeedTest
{
    internal class Program
    {
        static void Main(string[] args)
        {
            var connectionString = "data source=.<Redacted>";
            string[] Brands = new string[] { "Tesco", "Asda", "Boots", "Morrisons", "Amazon", "Ebay" };

            var rnd = new Random();
            DateTime startTime;

            using (var con = new SqlConnection(connectionString))
            {
                var cmd = new SqlCommand("delete from products", con);
                con.Open();
                cmd.ExecuteNonQuery();

                Console.WriteLine("Creating 100,000 products");

                for (int i = 0; i < 100000; i++)
                {
                    var brand = Brands[rnd.Next(Brands.Length)];
                    cmd.CommandText = $"insert into products(productName, brand, field1, field2, field3, field4, field5) values ('{brand}_{Guid.NewGuid()}', '{brand}', '{Guid.NewGuid()}', '{Guid.NewGuid()}', '{Guid.NewGuid()}', '{Guid.NewGuid()}', '{Guid.NewGuid()}')";
                    cmd.ExecuteNonQuery();
                }

                Console.WriteLine("Getting products by brand via ADO and product name prefix");
                startTime = DateTime.Now;

                foreach (var brand in Brands)
                {
                    cmd.CommandText = $"select * from products where productName like '{brand}_%'";
                    var da = new SqlDataAdapter(cmd);
                    var dt = new DataTable();
                    da.Fill(dt);
                }

                Console.WriteLine($"Time taken: {(DateTime.Now - startTime).TotalMilliseconds}ms");
                Console.WriteLine("Getting products by brand via ADO and indexed brand column");
                startTime = DateTime.Now;

                foreach (var brand in Brands)
                {
                    cmd.CommandText = $"select * from products where brand='{brand}'";
                    var da = new SqlDataAdapter(cmd);
                    var dt = new DataTable();
                    da.Fill(dt);
                }

                Console.WriteLine($"Time taken: {(DateTime.Now - startTime).TotalMilliseconds}ms");
                con.Close();
            }

            var db = new SpeedTestEntities();
            Console.WriteLine("Getting products by brand via entity framework and product name prefix");
            startTime = DateTime.Now;

            foreach (var brand in Brands)
            {
                var products = db.Products.Where(p => p.ProductName.StartsWith(brand + "_")).ToList();
            }

            Console.WriteLine($"Time taken: {(DateTime.Now - startTime).TotalMilliseconds}ms");
            Console.WriteLine("Getting products by brand via entity framework and indexed brand column");
            startTime = DateTime.Now;

            foreach (var brand in Brands)
            {
                var products = db.Products.Where(p => p.Brand.Equals(brand, StringComparison.OrdinalIgnoreCase)).ToList();
            }

            Console.WriteLine($"Time taken: {(DateTime.Now - startTime).TotalMilliseconds}ms");
            Console.ReadLine();
        }
    }
}

The Entity Framework results were about what I was expecting, but the ADO results showed searching on the indexed column was slower than on the product name prefix which surely can't be right:

Creating 100,000 products
Getting products by brand via ADO and product name prefix
Time taken: 558.9306ms
Getting products by brand via ADO and indexed brand column
Time taken: 642.5258ms
Getting products by brand via entity framework and product name prefix
Time taken: 3266.8438ms
Getting products by brand via entity framework and indexed brand column
Time taken: 204.932ms

I must have messed up somewhere but I can't see where. My demonstration of why we should be adding indexed columns to our database tables rather than prefixing other strings is going badly.

Can anyone rescue me and see what's going on here?

EDIT: it turns out that both ADO queries are doing an index scan on PK_Products. Both execution plans are the same. This surprises me and I thought adding an indexed column would certainly be faster but apparently not.

Execution plan

7
  • 2
    For performance please include the execution plans in question. Commented May 24 at 14:07
  • 3
    You do select * so there's gonna be a lot of key lookups when doing brand index, especially if it's not unique, sql server might decide to use the PK for all your searches. Also, it should be N'brand', or even better you should parametrize properly if not already doing it. Commented May 24 at 14:13
  • If you want to measure the time something takes, use Stopwatch, it is much more accurate than DateTime.
    – JonasH
    Commented May 24 at 14:16
  • ^^ ... and if you can use Benchmark.Dotnet but for DB benchmarks use the DB's tools.
    – Fildor
    Commented May 24 at 14:17
  • If I am reading correctly, EF Core meets your expection. So, I'd guess it puts together a better query? You can look at what it comes up with, though and diff to your ADO sql.
    – Fildor
    Commented May 24 at 14:19

2 Answers 2

1

Look at your execution plans. You've got too many rows in each brand for the index to be useful. So you're getting table scans in both cases.

Since you're fetching all the columns, the plan using the index would have to perform a lookup for each row. And a scan is faster than a large number of clustered index lookups.

enter image description here

And removing EF (which isn't relevant for SQL Server performance), the table scan with the LIKE was slightly slower:

Creating 100,000 products
Getting products by brand via ADO and product name prefix
Time taken: 686.7718ms rows 100000
Getting products by brand via ADO and indexed brand field
Time taken: 548.8414ms rows 100000

Also you're mostly measuring how long it takes to send 100,000 rows from SQL Server to the client over the network. If you want to measure the cost of queries, look at the actual execution plan. In the XML it tells you the CPU cost and all the waits. EG here it's

   <WaitStats>
      <Wait WaitType="ASYNC_NETWORK_IO" WaitTimeMs="257" WaitCount="1565" />
    </WaitStats>
    <QueryTimeStats CpuTime="39" ElapsedTime="295" />

Which shows you that this query took 295ms of elapsed time, but only 39ms of CPU time. The rest was time spent sending data to the client (ASYNC_NETWORK_IO).

-1

You can see the index as a sort of table which maps each unique indexed value to the list of records which have that value (records are represented by their internal unique identifier).

When you select by an indexed value, the DB has to scan the list of row IDs, and access the real table by that ID. You have 6 brands, which means each brand has 17% of total rows in the table. If this 17% means 17000 rows, then the DB has to access 17000 times the table.

When you select without an index, the DB makes a full scan on the table, so it reads all records in sequence.

Reading N records in sequence is much faster than reading N records "randomly", especially if your records have variable size, because the record position cannot be guessed.

Let's make an example. Suppose that:

  • your table has 100.000 records
  • reading a record in sequence takes 1 ns
  • reading one random record takes 10 ns on average

If your query has to read 1 record by full scan, than it will take 1 x 100.000 = 100.000 ns.

If your query has to read 1 record by index, than it will take 10 ns, which is much better.

If your query has to read 17.000 records by full scan, than it will take 1 x 100.000 = 100.000 ns (same as 1 record).

If your query has to read 17.000 records by index, it will take 10 x 17.000 = 170.000 which is worst than a full scan.

The DB should be smart enough to run a full scan instead of a indexed access even if your query can use an index, but this is subjected to the implementation, and accuracy of statistics. Table maintenance can be used to improve statistics. As suggested in comments, you should look at execution plan to check how the table is accessed.

Some tips:

  • try to use fixed sized types for all fields (i.e. char instead of varchar).
  • try to increase the number of brands, so that each brand has a lower percentage of records out of total.
  • try to run each test 2 times and measure just the duration of the second run. This will ensure that the index is loaded to memory (if it can fit the cache) before running the query (this is required because the table is recreated at the beginning of your test).

If you get better results by applying any of the above tips, please report it in the comments.

Not the answer you're looking for? Browse other questions tagged or ask your own question.