Nested loop inner join with index lookup and filter is slow

Question

I have this query I'm running in MySQL:

SELECT
    count(*)
FROM
    library AS l
    JOIN plays AS p ON p.user_id = l.user_id AND
    l.path = p.path
WHERE
    l.user_id = 20977 AND
    p.time >= '2022-10-17';

When EXPLAIN ANALYZE is run:

| -> Aggregate: count(0)  (cost=1085653.55 rows=6692) (actual time=12576.265..12576.266 rows=1 loops=1)
    -> Nested loop inner join  (cost=1084984.37 rows=6692) (actual time=40.604..12566.569 rows=56757 loops=1)
        -> Index lookup on l using user_id_2 (user_id=20977)  (cost=116747.95 rows=106784) (actual time=13.153..3783.204 rows=59631 loops=1)
        -> Filter: ((p.user_id = 20977) and (p.`time` >= TIMESTAMP'2022-10-17 00:00:00'))  (cost=8.24 rows=0) (actual time=0.135..0.147 rows=1 loops=59631)
            -> Index lookup on p using path (path=l.`path`)  (cost=8.24 rows=8) (actual time=0.090..0.146 rows=1 loops=59631)
 |
1 row in set (12.76 sec)

I obviously want to make this faster!

Table definitions

CREATE TABLE `library` (
  `user_id` int NOT NULL,
  `name` varchar(20) COLLATE utf8mb4_general_ci NOT NULL,
  `path` varchar(512) COLLATE utf8mb4_general_ci NOT NULL,
  `title` varchar(512) COLLATE utf8mb4_general_ci NOT NULL,
  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `edited` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `db_id` int NOT NULL,
  `tag` varchar(64) COLLATE utf8mb4_general_ci NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

CREATE TABLE `plays` (
  `user_id` int DEFAULT NULL,
  `name` varchar(20) CHARACTER SET utf8 DEFAULT NULL,
  `path` varchar(512) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `time` datetime DEFAULT CURRENT_TIMESTAMP,
  `play_id` int NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

ALTER TABLE `library`
  ADD PRIMARY KEY (`db_id`),
  ADD KEY `user_id_loc` (`user_id`,`name`,`path`(191)),
  ADD KEY `edited` (`edited`),
  ADD KEY `created` (`created`),
  ADD KEY `title` (`title`),
  ADD KEY `user_id` (`user_id`),
  ADD INDEX `user_id_by_title` (`user_id`, `title`);

ALTER TABLE `plays`
  ADD PRIMARY KEY (`play_id`),
  ADD KEY `user_id` (`user_id`,`name`,`path`(255)),
  ADD KEY `user_id_2` (`user_id`,`name`),
  ADD KEY `time` (`time`),
  ADD KEY `path` (`path`),
  ADD KEY `user_id_3` (`user_id`,`name`,`path`,`time`);

It looks like the killer is the looping over 59631 rows.

Would an index on (user_id, time) make it faster?

Interestingly, the user_id_2 index is actually an index on (user_id, title), rather than the plain user_id index. I'm not sure why user_id_2 is chosen given title isn't used in the query.

Your code has some errors (ERROR 1072 (42000): Key column 'id' doesn't exist in table library). I can't run it to try to reproduce the issue. This makes me doubt if any of the code is relevant to your query. Please post the actual code you're having trouble with. Or better yet, make a dbfiddle. — Bill Karwin, Commented Oct 17, 2023 at 18:49
@BillKarwin Sorry. Truth is the column name is id. However I posted a previous question on SO where a lot of feedback was about column naming, so I tried to find and replace the column name, but I missed one out. I'll fix that so it's all user_id now. — Dan Gravell, Commented Oct 18, 2023 at 10:53
@TheImpaler What do you mean by malformed? I assumed you meant some sort of syntax error, but it runs ok for me (MySQL). However your mention of "silently converts" makes me think you mean it's not optimal, rather than malformed. So what do you mean by fixing it? The criteria appears as it should - the rows from the plays table should only be brought back having p.time >= '2022-10-17'. Do you mean just the fact it should not be a LEFT JOIN because of the implicit criteria? Agree - but I was just trying to use the query I have - if this is the cause of the slowdown - great! — Dan Gravell, Commented Oct 18, 2023 at 10:57
@DanGravell Yes, the LEFT in LEFT JOIN shouldn't be there. It misguides your analysis. From the optimizer's perspective there are big differences between an outer join (less options to optimize) and an inner join (more avenues for optimization). — The Impaler, Commented Oct 18, 2023 at 13:47
The join issue doesn't confuse MySQL much. It automatically optimizes it as an inner join when it detects that the condition makes the left outer join function as an inner join. Cf. dev.mysql.com/doc/refman/8.0/en/outer-join-optimization.html — Bill Karwin, Commented Oct 18, 2023 at 13:55

Bill Karwin · Accepted Answer · 2023-10-18 14:02:37Z

I tested your query and tried a different index in each table.

ALTER TABLE library ADD KEY bk1 (user_id, path); 

ALTER TABLE plays ADD KEY bk2 (user_id, path, time); 

EXPLAIN SELECT
    COUNT(*)
FROM
    library AS l USE INDEX (bk1)
    JOIN plays AS p USE INDEX (bk2)
      ON p.user_id = l.user_id 
      AND l.path = p.path
WHERE
    l.user_id = 20977 
    AND p.time >= '2022-10-17';

+----+-------------+-------+------------+------+---------------+------+---------+-------------------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref               | rows | filtered | Extra                    |
+----+-------------+-------+------------+------+---------------+------+---------+-------------------+------+----------+--------------------------+
|  1 | SIMPLE      | l     | NULL       | ref  | bk1           | bk1  | 4       | const             |    1 |   100.00 | Using index              |
|  1 | SIMPLE      | p     | NULL       | ref  | bk2           | bk2  | 2056    | const,test.l.path |    1 |   100.00 | Using where; Using index |
+----+-------------+-------+------------+------+---------------+------+---------+-------------------+------+----------+--------------------------+

The note "Using index" in each row of the EXPLAIN report shows that it's getting the benefit of a covering index for both tables.

I didn't use prefix index syntax, because that would spoil the covering index optimization. It's not necessary to use prefix indexes for this example on modern MySQL versions because they default to an InnoDB row format that supports 3072-byte indexes instead of the old MySQL that only supported 768-byte indexes by default.

In my test, I had zero rows in the tables I tested, so I had to use an index hint to make the optimizer choose my new indexes. In a table with a substantial number of rows, the optimizer might choose the new indexes on its own.

I'm creating the indexes to test now. Does this work because the join uses an index with the same prefix between the two new indexes - (user_id, path, ...) - and then the bk2 is ordered by time so that the time comparison is very fast? — Dan Gravell, Commented Oct 18, 2023 at 14:13
The covering index optimization is simply that the two indexes contain all the columns needed for the query, so MySQL can skip reading rows from the table at all. The order of the columns in my indexes is important. Columns used for equality conditions are first (leftmost), then one column for a range condition is next, then more columns as needed for the covering index. — Bill Karwin, Commented Oct 18, 2023 at 14:21
You might like my presentation How to Design Indexes, Really or the video. — Bill Karwin, Commented Oct 18, 2023 at 14:22
Thanks - the telephone book analogy describes well how I understand this, but I hadn't thought about also including the other columns needed for the query. What if there's a second column after the time that were a range query, e.g. duration? Would that mandate another index, like ALTER TABLE plays ADD KEY bk3 (user_id, path, duration); and the query would be able to easily shift from bk2 to bk3 because it already has the (user_id, path) 'prefix'? — Dan Gravell, Commented Oct 18, 2023 at 14:33
Yes, ideally the optimizer will choose the best index for each query. You may need multiple indexes that have some columns in common, to support different queries. Unfortunately, in practice sometimes the optimizer gets confused about indexes with common leading columns, and picks the wrong one. So we need to use index hint syntax sometimes (I try to avoid it unless absolutely necessary). — Bill Karwin, Commented Oct 18, 2023 at 14:42

Rick James · Accepted Answer · 2023-10-19 20:53:04Z

DROP these, they in the way and/or redundant:

l: `user_id` (`user_id`),
p: `user_id` (`user_id`,`name`,`path`(255)),
p: `user_id_2` (`user_id`,`name`),

Add these:

l:  INDEX(user_id,  path)
p:  INDEX(user_id,  path, time)
p:  INDEX(user_id,  time, path)   -- see below

Change (MySQL 5.7/8.0 no longer needs the prefix kludge):

l:  `user_id_loc` (`user_id`,`name`,`path`)  -- tossing 191

Try to avoid testing columns from different table in WHERE.

First I saw

    WHERE  l.user_id = 20977
      AND  p.time >= '2022-10-17';

and assumed that was the crux of the problem. But then I saw that you did not have INDEX(user_id, time) on p and that the tables are joined [partially] on user_id.

Suggest (to avoid my confusion) that you make this change:

    WHERE  l.user_id = 20977   -- >
    WHERE  p.user_id = 20977

The Optimizer should be smart enough to realize that, then use

p:  INDEX(user_id,  time, path)   -- as mentioned above

But, once you have done that, the query collapses to

SELECT COUNT(DISTINCT user_id, path)
    FROM plays
    WHERE  user_id = 20977
      AND  time >= '2022-10-17';

I think that it will say "Covering index skip scan for deduplication" to indicate that it is not actually scanning all 60K rows in plays, but hopping through the index!

However, if there are "plays" that do not have a corresponding entry in "library", then the count will be high by the number of missing user-play combos.

When the table has both INDEX(a), INDEX(a,b).:

When a query needs just (a), then either index will work.
When a query needs (a,b), the Optimizer is likely to pick (a) because it is smaller while failing to realize that the bigger index would be better.

For that reason, I suggested some of the DROPs.

The other reason for DROPs is to get rid of "prefix" indexing (path(255)) which is either counterproductive and/or no longer necessary.

DROP these - but these indexes are used by other queries... so I can't really just drop them. I'll go ahead and test your other suggestions though. — Dan Gravell, Commented Oct 19, 2023 at 14:39
I tried this, but it looks like it's not using the correct indexes (I removed the USE INDEX statements suggested by @BillKarwin because of the additional index you suggested above - I wasn't sure where this should be used). As a result it takes almost 7s - explain.depesz.com/s/htQ1 — Dan Gravell, Commented Oct 19, 2023 at 17:03
@DanGravell - I addressed both of your comments in an addition to my Answer. — Rick James, Commented Oct 19, 2023 at 20:54
thanks. As I said, I can't drop them, because they are required by other queries. I tried with USE INDEX again but I'm not sure where you're suggesting INDEX(user_id, time, path) should be used - I can only see the FROM and JOIN statements as possible places. — Dan Gravell, Commented Oct 20, 2023 at 15:59
I also tried the alternative query - SELECT COUNT(DISTINCT... - but this took double the time. It appeared to only use the shortest user_id index. I changed it to use the INDEX(user_id, time, path) but that still takes 4s. — Dan Gravell, Commented Oct 20, 2023 at 16:01

Collectives™ on Stack Overflow

Nested loop inner join with index lookup and filter is slow

Table definitions

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
sql
mysql
join
sql-execution-plan
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

Table definitions

2 Answers 2

Not the answer you're looking for? Browse other questions tagged sqlmysqljoinsql-execution-plan or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
sql
mysql
join
sql-execution-plan
or ask your own question.