Common problems with SQL - Indexes

And now let's check some problems with the SQL queries and indexes. There are two basic problems related to indexes - using an index when the index should not be used (false positive) and not using an index when it should be used (false negative). If you don't know what indexes are or if you're not sure how they work, read something about it - for example my article basic principles of database indexes describes basics of indexes.

The former case (false positive) is usually caused by administrator mistakes (stale statistics or improper settings of *_cost variables, which results in inability to compute precise estimate of execution plan cost, etc.). Stale statistics were described in the part devoted to the administrator mistakes, so I won't analyze that again.

Regarding the latter case, the scale of possible causes is much wider - besides already mentioned stale statistics there are many ways how to screw the SQL query so that the index may not be used. And some of those ways will be described in the following sections ...

Missing index

The first case is quite simple - although the query literally begs for an index, no suitable index exists. There ase several possible causes - for example the table was very small originally so the index would not be very useful (for small tables a sequential scan of the whole table is usually much more effective) and noone expected the table to grow so much.

Anyway once you discover a query spending most of the time by sequential scan of a table, and the processing of a condition may be improved using an index, it's more than time to consider creating it. Let's create a table containing 1.000.000 rows:

CREATE TABLE test_table (id INTEGER, value VARCHAR(255));
INSERT INTO test_table SELECT i, REPEAT(MD5(i::text), 5)
                         FROM generate_series(1,1000000) AS s(i);
ANALYZE test_table;

Now let's try to search out a single row containing a given value in the ID column:

db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id = 741283;

                              QUERY PLAN
----------------------------------------------------------------------
 Seq Scan on test_table  (cost=0.00..36891.39 rows=1 width=168)
              (actual time=11447.660..15196.489 rows=1 loops=1)
   Filter: (id = 741283)
 Total runtime: 15196.525 ms
(3 rows)

Well, and now let's create an index on the ID column, and check the execution plan again:

db=# CREATE INDEX test_table_idx ON test_table(id);
db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id = 741283;

                              QUERY PLAN
----------------------------------------------------------------------
 Index Scan using test_table_idx on test_table
       (cost=0.00..8.36 rows=1 width=168)
       (actual time=0.235..0.237 rows=1 loops=1)
   Index Cond: (id = 741283)
 Total runtime: 0.265 ms
(3 rows)

As you can see, the index was used and the impact upon performance is huge, no matter if we compare the real time necessary to process the query or the abstract cost estimate involving I/O or CPU operations, etc.

Note: Although it seems that adding an index into a database is trivial and trouble free task, it may not be the case. For example in case of databases with a lot of modifications (inserting new rows, deleting old rows and updating existing ones) this may be a problem - each index brings some overhead when modifying data and maintenance. On the other hand the index may be used when performing the modifications (for example when deleting the rows using a WHERE condition). In general adding an index may change execution plans of other queries. Adding an index to a production database without proper testing is acceptable only if solving urgent performance problems - otherwise it should be tested just like all the other changes.

Low selectiveness of a condition

Well - a missing index is a really trivial problem. Lets check a more complicated situation when the index exists, but is not used for some reason. First possible case is related to the fact that using an index improves the performance only when reading a small fraction of a table - if too large portion of the table match the condition (the actual number of rows given by expression "too large portion of the table" depends on cost variables, data types, etc.), it's more effective to read the whole table sequentially and then evaluate the condition.

This is caused by differences between sequential and random access to the data stored on a hard drive, details may be found for example in basic principles of database indexes.

Consider another table test_table with 1.000.000 rows an indexe on a column ID:

CREATE TABLE test_table (id INTEGER, value VARCHAR(255));
INSERT INTO test_table SELECT i, REPEAT(MD5(i::text), 5)
                         FROM generate_series(1,1000000) AS s(i);
CREATE INDEX test_table_idx ON test_table(id);
ANALYZE test_table;

and now let's try to search out rows with ID less than 50.000:

db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id <= 50000;

                              QUERY PLAN
----------------------------------------------------------------------
 Index Scan using test_table_idx on test_table
       (cost=0.00..2167.15 rows=42675 width=168)
       (actual time=0.084..114.564 rows=49999 loops=1)
   Index Cond: (id < 50000)
 Total runtime: 173.257 ms
(3 rows)

As expected an index scan was used - if we increase the limit enough (say to 800.000), we'll get a different execution plan:

db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id <= 800000;

                              QUERY PLAN
----------------------------------------------------------------------
 Seq Scan on test_table
     (cost=0.00..36891.39 rows=806072 width=168)
     (actual time=0.032..1298.406 rows=799999 loops=1)
   Filter: (id < 800000)
 Total runtime: 2204.269 ms
(3 rows)

Yes, a sequential scan is chosen - we can check that the sequential scan is more effective than an index scan by "disabling" sequential scans using enable_seqscan variable (in reality it's not possible to disable sequential scans - this just means a huge penalization when computing a cost estimate).

db=# SET enable_seqscan = off;
db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id <= 800000;

                              QUERY PLAN
----------------------------------------------------------------------
 Index Scan using test_table_idx on test_table
       (cost=0.00..40858.60 rows=806072 width=168)
       (actual time=0.093..1303.429 rows=799999 loops=1)
   Index Cond: (id < 800000)
 Total runtime: 2236.221 ms
(3 rows)

So the sequential scan is faster than the forced index scan (although the difference is quite small).

If we create a more complex (and mainly more "large") index, for example index on columns (id, value), the results are even more interesting and the sequential scan will be more effective "sooner":

CREATE TABLE test_table (id INTEGER, value VARCHAR(255));
INSERT INTO test_table SELECT i, REPEAT(MD5(i::text), 5)
                         FROM generate_series(1,1000000) AS s(i);
CREATE INDEX test_table_val_idx ON test_table(id, value);
ANALYZE test_table;

At the beginning the index scan is more effective just like before:

db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id <= 10000;

                              QUERY PLAN
----------------------------------------------------------------------
 Index Scan using test_table_idx on test_table
       (cost=0.00..13307.57 rows=8138 width=168)
       (actual time=0.023..16.709 rows=10000 loops=1)
   Index Cond: (id <= 10000)
 Total runtime: 28.436 ms
(3 rows)

After increasing the value to 50.000 a bitmap index scan will be used instead of a plain index scan:

db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id <= 50000;

                              QUERY PLAN
----------------------------------------------------------------------
 Bitmap Heap Scan on test_table
        (cost=5294.69..30418.82 rows=47665 width=168)
        (actual time=11.261..75.096 rows=50000 loops=1)
   Recheck Cond: (id <= 50000)
   ->  Bitmap Index Scan on test_table_idx
              (cost=0.00..5282.77 rows=47665 width=0)
              (actual time=10.956..10.956 rows=50000 loops=1)
         Index Cond: (id <= 50000)
 Total runtime: 128.998 ms
(5 rows)

And after another increase the sequential scan begins to win again:

db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id <= 200000;

                              QUERY PLAN
----------------------------------------------------------------------
 Seq Scan on test_table
     (cost=0.00..36891.39 rows=197642 width=168)
     (actual time=0.016..645.543 rows=200000 loops=1)
   Filter: (id <= 200000)
 Total runtime: 869.106 ms
(3 rows)

Note: The border values (crossing them means a different execution plan is chosen) is closely related to cost estimates, i.e. (a) statistics of tables and (b) variables indicating cost of the basic operations. If you use different values, your database may behave differently - for the examples listed above the default values were used, i.e. namely random_page_cost = 4 and seq_page_cost =1, but there are other *_cost variables - details may be found in the documentation.

What arises from all this? It is not true that a sequential scans are evil and should be eliminated for all cost. In some cases the sequential scan is actually faster and more effective than index scan. If you believe the index scan would be faster, it's necessary to verify it and then fix the cause eventually.

Verification is not difficult - just "disable" the sequential scan in your session, and execute the query again - an index scan should be used and you will immediately see if the performance improved or not. If the index scan is (notably) faster, you should fix the cause. You may tune the *_cost variables in the postgresql.conf (but be aware that this will affect all queries and setting unsuitable values may "break" them). So you should thoroughly test this change and verify that it actually improves the precision of estimates (and thus the proper execution plans are chosen).

Secondly, you may "fix" the problematic query by setting "correct" values to *_cost variables before executing the query, disable the sequential scan etc. But I don't think this is a good solution - you won't influence the other queries (as the values from postgresql.conf will be used), but the application will be messed up with inappropriate information about hardware and so on. What happens if you migrate the application to a different server? Will you browse all the source codes and change those values?

Third option represents a side-step - if your query contains multiple conditions for a given table, try to find one (maybe involving multiple columns) with a higher selectiveness and define an index on it.

Consider a table with two independently indexed columns with values between  0..9 in random combinations:

CREATE TABLE test_table (id1 INTEGER, id2 INTEGER);
INSERT INTO test_table SELECT mod(i, 10), round(random() * 9)
                         FROM generate_series(1,1000000) AS s(i)
CREATE INDEX test_table_idx1 ON test_table(id1);
CREATE INDEX test_table_idx2 ON test_table(id2);
ANALYZE test_table;

and check this query

SELECT * FROM test_table WHERE id1 < 5 AND id2 < 5;

However, 50% selectiveness on each of the columns is not enough, so a sequential scan will be used:

db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id1 < 5 AND id2 < 5;

                              QUERY PLAN
----------------------------------------------------------------------
 Seq Scan on test_table
     (cost=0.00..28275.00 rows=256002 width=8)
     (actual time=80.712..682.514 rows=224585 loops=1)
   Filter: ((id1 < 5) AND (id2 < 5))
 Total runtime: 963.165 ms
(3 rows)

But if we define a multi-column index on both columns, the situation changes - the selectiveness decreases from 50% to 25%, and the index scan wins:

db=# CREATE INDEX test_table_idx ON test_table (id1, id2);
db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id1 < 5 AND id2 < 5;

                              QUERY PLAN
----------------------------------------------------------------------
 Bitmap Heap Scan on test_table
        (cost=10432.36..27547.39 rows=256002 width=8)
        (actual time=76.400..400.486 rows=224585 loops=1)
   Recheck Cond: ((id1 < 5) AND (id2 < 5))
   ->  Bitmap Index Scan on test_table_idx
              (cost=0.00..10368.36 rows=256002 width=0)
              (actual time=75.115..75.115 rows=224585 loops=1)
         Index Cond: ((id1 < 5) AND (id2 < 5))
 Total runtime: 657.373 ms
(5 rows)

Well, the difference is not huge, but in practice the results are usually much better.

Incorrectly formulated condition

Other possible cause - in this case an obvious error of the developer who wrote the SQL query - is an incorrect formulation of the condition, unwillingly preventing usage of the index. Consider a "perfect: table

CREATE TABLE test_table (id INTEGER, value VARCHAR(255));
INSERT INTO test_table SELECT i, REPEAT(MD5(i::text), 5)
                         FROM generate_series(1,1000000) AS s(i);
ALTER TABLE test_table ADD PRIMARY KEY (id);
ANALYZE test_table;

And now show several quite stupid SQL queries on a primary key:

SELECT * FROM test_table WHERE id + 1 = 233449;
SELECT * FROM test_table WHERE id * id = 54756;
SELECT * FROM test_table WHERE id = 1::double;

These are queries on a primary key, so it will be fast, right? No - all of them will be evaluated using a sequential scan of a table containing 1 million of rows, and the performance difference may be several orders of magnitude. The problem is that the left hand side of the condition contains an expression involving the column, not the column alone - in case of the first query it is an aritmetic operation "plus", in case of the second query it's a square root, in the third case it's a retype.

There are two ways how to fix this: (a) reformulate the condition so that the left hand side of the condition contains the columns alone so that an index may be used, and (b) create an index on an expression.

Reformulating a condition is much usually easier and more effective, but sometimes it's not possible. In all the examples listed above the reformulation is very simple:

SELECT * FROM test_table WHERE id = 233448;
SELECT * FROM test_table WHERE id = sqrt(54756);
SELECT * FROM test_table WHERE id = 1;

I often hit such problems when solving "clever" queries working with dates and time values in general. Consider a table

CREATE TABLE test_table (id INTEGER, value TIMESTAMP);
INSERT INTO test_table (id, value)
     SELECT i, to_timestamp(1248559618 + i * RANDOM())
       FROM generate_series(1,1000000) s(i);
CREATE INDEX test_table_idx ON test_table (value);
ANALYZE test_table;

Unfortunately I often see queries similar to these:

-- select rows for a given day
SELECT * FROM test_table WHERE value LIKE '2009-08-%';
SELECT * FROM test_table WHERE date(value) = '2009-08-01';

-- select records for the last 15 minutes
SELECT * FROM test_table WHERE value - now() <= interval '15 minutes';

that all result in a sequential scan, which may be a problem for large tables. And just like in the previous example, its not difficult to rewrite the queries to use an index:

-- first two queries are reformulated to the same query
SELECT * FROM test_table WHERE value >= '2009-08-01'
                           AND value <  '2009-08-02';

-- third query
SELECT * FROM test_table WHERE value <= now() + interval '15 minutes';

There is about million other ways to pervent usage of indexes, but I believe the examples listed above are sufficient for illustration purposes.

Comments

There are no comments for this article (or are awaiting acceptance).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)