Common problems with SQL - administrator mistakes
It may be surprising, but many of the performance problems with SQL statements are originally caused by mistakes of the administrators. Some of these problems are closely related to the particalar database system. Thera are infinitely many ways to misconfigure a system or a database - let's check at least two quite frequent problems.
When the database prepares execution plan, it prepares several possible execution plans, and then evaluates their "cost" for each of them, i.e. it evaluates cost of the operations (CPU usage, I/O operations, etc.) necessary to evaluate the SQL statement using the particular execution plan. Then is chooses the "cheapest" execution plan and executes the statement.
When evaluating the cost of a given execution plan various statistics are used - information about size of the tables, number of rows, histograms of values in a given column, etc. Due to this, if the statistics are not up-to-date, the database may not choose the cheapest execution plan (as it believes one of the tables is much smaller, etc.)
How to find out that the statistics are not precise? Quite easily from the execution plan, received using the EXEPLAIN ANALYZE as it contains estimates of important parameters (e.g. number of returned rows) evaluated when creating the execution plan, as well as values observed when actually running the statement. If the values are considerably different, it's an unmistakable sign that the statistics are not precise - either stale or not detailed enough.
Fixing this is usually quite simple - all you need to do is to update the statistics. In PostgreSQL this may be done manually using the ANALYZE command, but for continuous maintenance the autovacuum demon is much better. In Oracle this may be done using the DBMS_STATS procedures.
Example: Let's create a simple table with two columns (ID and VALUE) and fill it with random data (1.000.000 rows), and update the statistics after inserting the first row - all this with disabled autovacuum demon, of course.
CREATE TABLE test_table (id INTEGER, value VARCHAR(255)); CREATE INDEX test_table_idx ON test_table(id); INSERT INTO test_table SELECT mod(i, 100) + 1, REPEAT(MD5(i::text), 5) FROM generate_series(1,1) AS s(i); ANALYZE test_table; INSERT INTO test_table SELECT mod(i, 100) + 1, REPEAT(MD5(i::text), 5) FROM generate_series(2,1000000) AS s(i);
If we try to read all the rows with ID = 20, i.e. 1% or the table (10.000 rows], we'll get the following execution plan using EXPLAIN ANALYZE command:
db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id = 20; QUERY PLAN ---------------------------------------------------------------------- Index Scan using test_table_idx on test_table (cost=0.00..8.39 rows=1 width=168) (actual time=0.087..131.211 rows=10000 loops=1) Index Cond: (id = 20) Total runtime: 142.704 ms (3 rows)
As you can see, while the estimated resulting number of rows is 1, in reality there are 10.000 rows - which is a huge difference. Don't be confused by the difference of "cost" and "actual time" values - these values may not be compared; the "cost" denotes an abstract price of the given step (initialization and evaluation), "actual time" is the time necessary to evaluate the query.
Let's update the statistics and execute the query again:
db=# ANALYZE test_table; db=# EXPLAIN ANALYZE SELECT * FROM test_table WHERE id = 20; QUERY PLAN ---------------------------------------------------------------------- Bitmap Heap Scan on test_table (cost=198.81..18609.48 rows=9604 width=168) (actual time=6.582..118.196 rows=10000 loops=1) Recheck Cond: (id = 20) -> Bitmap Index Scan on test_table_idx (cost=0.00..196.41 rows=9604 width=0) (actual time=3.313..3.313 rows=10000 loops=1) Index Cond: (id = 20) Total runtime: 129.664 ms (5 rows)
Obviously, the statistics are much more precise - generally, the evaluated and actual values will never be the same, and only serious differences (orders of magnitute etc.) may be a problem. The update of statistics actually caused a choice of a different execution plan, although the speed has not improved a lot - in case of more complex statements (multiple joins, more indexes, more conditions, more complex conditions, etc.) the precise statistics may decide if the query executes within a second or for the whole day.
By the way - there is a great web tool at explain.depesz.com you may use to analyze execution plans, highlight the most problematic parts, etc.
If the stale statistics present a potential problem in virtually all database systems, bloated tables and indexes are quite closely related to the MVCC architecture of the database. While this problem does not exist in Oracle, in PostgreSQL it may cause serious performance issues.
When modifying (UPDATE) a row, PostgreSQL creates a "copy" of the row with the new values, and marks the original row as "deleted" (and notes the ID of the transaction). In other words - the original rows remain in the table (and indexes), so they're still available to other transactions (READ COMMITED) and a ROLLBACK is possible. The problem the tables and indexes may bloat - if you update the whole table, it's size will about double (and the same holds for indexes).
So it's necessary to clean the tables and indexes regularly - remove the dead rows (dead tuples) that are not necessary for any transaction. In PostgreSQL this may be done using the VACUUM [FULL] [ANALYZE] that performs "single-shot" clean-out, and even better by autovacuum démon for continuous automatic cleaning.
How to find out the tables or indexes have bloated? There is no simple answer - PostgreSQL contains a lot of interesting statistical system catalogs with various information about size of tables, number of rows, number of dead tuples etc. - i.e.exactly the information you need. The trouble is that collection of these information may be disabled - so check that the statistics collector is enabled in the configuration file (notice the configuration differs between various PostgreSQL versions, especially between 8.2 a 8.3). But if all the important statistics are collected, it's quite simple to write a script reading data from pg_stat_user_tables catalog and if the ration between live and dead tuples reaches given value (say 1:1), a defined action is performed - e.g. an e-mail is sent to administrator, etc.
The best solution is obviously an autovacuum demon, but even if enabled it may be not be configured properly (e.g. when loading large amounts of data regularly, etc.). But configuting autovacuum is clearly out of scope of this article.