Common problems with SQL - Structure

Let's check another source of problems - structure of a table. Again, this is a very broad topic, so I'll present just one very common example, I guess you'll be able to come up with many variants.

Inadequate data type of a column

Say I need to store IP addresses of clients in table (e.g. audit log) - I very often see this:

CREATE TABLE test_table (id INTEGER, ip VARCHAR(15));

Sure, it works, so where is the problem? Well, while the IP address occupies just 4 bytes, this version uses about 15 bytes, i.e. about four times the required space (and that's without network mask etc.). And yet it's not the most important disadvantage - after all the storage spage is quite cheap nowadays and there's always couple of free gigabytes on the drive, right?

The problem is that the column behaves as a text in all respects - even a comparison is not performed binary (byte by byte) as the linguistic rules have to be considered - and those may be quite complicated (e.g. for Czech language), which may cause much higher CPU usage etc.

Let's fill the table created above with example data (generated so that there are duplicated IP addresses, each of them 15 characters long)

INSERT INTO test_table SELECT i, (200 + round(random()*10)) || '.' ||
                                 (200 + round(random()*10)) || '.' ||
                                 (200 + round(random()*10)) || '.' ||
                                 (200 + round(random()*10))
                       FROM generate_series(1,1000000) s(i);
CREATE INDEX test_table_idx ON test_table(ip);
ANALYZE test_table;

and execute these three example queries:

-- search a specified IP address
SELECT * FROM test_table WHERE ip = '205.201.207.210';

-- search all IP addresses in the 205.0.0.0/8 network
SELECT * FROM test_table WHERE ip BETWEEN '205.0' AND '205.9';

-- count unique IP addresses (number of occurences for each)
SELECT ip, COUNT(*) FROM test_table GROUP BY ip;

Let's check their execution plans

db=# EXPLAIN ANALYZE SELECT * FROM test_table
                             WHERE ip = '205.201.207.210';

                              QUERY PLAN
----------------------------------------------------------------------
 Bitmap Heap Scan on test_table
        (cost=5.12..339.75 rows=92 width=20)
        (actual time=0.086..0.217 rows=47 loops=1)
   Recheck Cond: ((ip)::text = '205.201.207.210'::text)
   ->  Bitmap Index Scan on test_table_idx
              (cost=0.00..5.09 rows=92 width=0)
              (actual time=0.071..0.071 rows=47 loops=1)
         Index Cond: ((ip)::text = '205.201.207.210'::text)
 Total runtime: 0.300 ms
(5 rows)

Not bad, right? The index was used, the query does not consume a lot of resources (cost=339 might be lower but it's fine), and it executes quite fast.

db=# EXPLAIN ANALYZE SELECT * FROM test_table
                             WHERE ip BETWEEN '205.0' AND '205.9';

                              QUERY PLAN
----------------------------------------------------------------------
 Bitmap Heap Scan on test_table
        (cost=251.60..6502.15 rows=9678 width=20)
        (actual time=120.095..294.266 rows=100385 loops=1)
   Recheck Cond: (((ip)::text >= '205.0'::text) AND
                  ((ip)::text <= '205.9'::text))
   ->  Bitmap Index Scan on test_table_idx
              (cost=0.00..249.18 rows=9678 width=0)
              (actual time=118.547..118.547 rows=100385 loops=1)
         Index Cond: (((ip)::text >= '205.0'::text) AND
                      ((ip)::text <= '205.9'::text))
 Total runtime: 400.851 ms
(5 rows)

That's worse - the index was used just like before, but the time necessary to evaluate the query is not very favourable. And moreover - if you need to search for IP addresses using more complicated rules (various netmasks etc.) the results won't get better.

And now the last query:

db=# EXPLAIN ANALYZE SELECT ip, COUNT(*) FROM test_table GROUP BY ip;

                              QUERY PLAN
----------------------------------------------------------------------
 HashAggregate  (cost=20884.65..21019.03 rows=10750 width=16)
                (actual time=3127.823..3151.809 rows=14641 loops=1)
   ->  Seq Scan on test_table
           (cost=0.00..15884.10 rows=1000110 width=16)
           (actual time=0.218..1169.862 rows=1000000 loops=1)
 Total runtime: 3167.560 ms
(3 rows)

This logically results in sequential scan of the table - PostgreSQL does not store visibility info in the indexes, so such queries may not be evaluated just by reading the index (so the table hast bo be read even if the index contains all the necessary columns).

I bet you just said those execution plans are not really bad - yes, you're right. There's an index on the column, it was used whenever possible and the queries are evaluated quite fast. I'll show you a small "twist" - the relational databases usually contain specialized data types beyond those listed in SQL, and PostgreSQL contains INET data type, intended for IP addresses. Let's try to use it instead of the VARCHAR data type.

CREATE TABLE test_table (id INTEGER, ip INET);
INSERT INTO test_table SELECT i,((200 + round(random()*10)) || '.' ||
                                 (200 + round(random()*10)) || '.' ||
                                 (200 + round(random()*10)) || '.' ||
                                 (200 + round(random()*10)))::inet
                       FROM generate_series(1,1000000) s(i);
CREATE INDEX test_table_idx ON test_table(ip);
ANALYZE test_table;

And let's see how it influences the execution plans listed above:

db=# EXPLAIN ANALYZE SELECT * FROM test_table
                             WHERE ip = '205.201.207.210';

                              QUERY PLAN
----------------------------------------------------------------------
 Bitmap Heap Scan on test_table
        (cost=4.98..295.32 rows=80 width=11)
        (actual time=0.061..0.227 rows=60 loops=1)
   Recheck Cond: (ip = '205.201.207.210'::inet)
   ->  Bitmap Index Scan on test_table_idx
              (cost=0.00..4.96 rows=80 width=0)
              (actual time=0.044..0.044 rows=60 loops=1)
         Index Cond: (ip = '205.201.207.210'::inet)
 Total runtime: 0.326 ms
(5 rows)

Not a big change - actual time or cost of the query did not change (but we did not expect that anyway). The next query:

db=# EXPLAIN ANALYZE SELECT * FROM test_table
                             WHERE ip BETWEEN '205.0' AND '205.9';

                              QUERY PLAN
----------------------------------------------------------------------
 Bitmap Heap Scan on test_table
        (cost=2210.12..8353.07 rows=500004 width=11)
        (actual time=33.760..221.775 rows=99585 loops=1)
   Filter: (ip << '205.0.0.0/8'::inet)
   ->  Bitmap Index Scan on test_table_idx
              (cost=0.00..2085.12 rows=99276 width=0)
              (actual time=32.549..32.549 rows=99585 loops=1)
         Index Cond: ((ip > '205.0.0.0/8'::inet) AND
                      (ip <= '205.255.255.255'::inet))
 Total runtime: 328.314 ms
(5 rows)

That's better - the improvement of about 20% is not something world-shaking, but it's nice ;-)

And now the last query:

db=# EXPLAIN ANALYZE SELECT ip, COUNT(*) FROM test_table GROUP BY ip;

                              QUERY PLAN
----------------------------------------------------------------------
 HashAggregate  (cost=20884.65..21019.03 rows=10750 width=16)
                (actual time=3127.823..3151.809 rows=14641 loops=1)
   ->  Seq Scan on test_table
           (cost=0.00..15884.10 rows=1000110 width=16)
           (actual time=0.218..1169.862 rows=1000000 loops=1)
 Total runtime: 3167.560 ms
(3 rows)

Again, the difference is just minimal when compared to the VARCHAR column - so why do I mention that? As already noted, the big difference between VARCHAR and INET data types is when comparing the values - and the previous examples do not exploit this difference.

So let's try another example - let's fill the example table with data and then select the number of unique IP addresses (with and without an index):

CREATE TABLE test_table (id INTEGER, ip VARCHAR(15));
INSERT INTO test_table SELECT i, (1 + round(random()*254) || '.' ||
                                  1 + round(random()*254) || '.' ||
                                  1 + round(random()*254) || '.' ||
                                  1 + round(random()*254))
                       FROM generate_series(1,1000000) s(i);
CREATE INDEX test_table_idx ON test_table(ip);
ANALYZE test_table;

-- check the table and index sizes
SELECT relname, relpages, reltuples FROM pg_class
                                   WHERE relname LIKE 'test_table%';

    relname     | relpages | reltuples
----------------+----------+-----------
 test_table     |     5864 |    999918
 test_table_idx |     3819 |    999918
(2 rows)

-- select the unique IPs
EXPLAIN ANALYZE SELECT DISTINCT ip FROM test_table;

                              QUERY PLAN
----------------------------------------------------------------------
 Unique  (cost=0.00..56234.57 rows=1000129 width=14)
         (actual time=0.064..9917.689 rows=999889 loops=1)                                               
   ->  Index Scan using test_table_idx on test_table
             (cost=0.00..53734.24 rows=1000129 width=14)
             (actual time=0.059..7346.988 rows=1000000 loops=1) 
 Total runtime: 11028.266 ms                                                                                                                         
(3 rows)

-- remove the index
DROP INDEX test_table_idx;

-- select the unique IPs
EXPLAIN ANALYZE SELECT DISTINCT ip FROM test_table;

                              QUERY PLAN
----------------------------------------------------------------------
 Unique  (cost=149720.60..154721.06 rows=1000092 width=14)
         (actual time=15914.665..22675.646 rows=999889 loops=1)                   
   ->  Sort  (cost=149720.60..152220.83 rows=1000092 width=14)
             (actual time=15914.659..20260.847 rows=1000000 loops=1)              
         Sort Key: ip                                                                                                               
         Sort Method:  external merge  Disk: 25712kB                                                                                
         ->  Seq Scan on test_table
                 (cost=0.00..15864.92 rows=1000092 width=14)
                 (actual time=0.012..1331.510 rows=1000000 loops=1) 
 Total runtime: 23764.882 ms                                                                                                        
(6 rows)

And now the variant with INET column:

CREATE TABLE test_table (id INTEGER, ip INET);
INSERT INTO test_table SELECT i, (1 + round(random()*254) || '.' ||
                                  1 + round(random()*254) || '.' ||
                                  1 + round(random()*254) || '.' ||
                                  1 + round(random()*254))::inet
                       FROM generate_series(1,1000000) s(i);
CREATE INDEX test_table_idx ON test_table(ip);
ANALYZE test_table;

-- check the table and index sizes
SELECT relname, relpages, reltuples FROM pg_class
                                   WHERE relname LIKE 'test_table%';

    relname     | relpages |  reltuples
----------------+----------+-------------
 test_table     |     4902 | 1.00001e+06
 test_table_idx |     2745 | 1.00001e+06
(2 rows)

-- select the unique IPs
EXPLAIN ANALYZE SELECT DISTINCT ip FROM test_table;

                              QUERY PLAN
----------------------------------------------------------------------
 Unique  (cost=0.00..48086.15 rows=999995 width=7)
         (actual time=0.068..8977.387 rows=999892 loops=1)
   ->  Index Scan using test_table_idx on test_table
             (cost=0.00..45586.16 rows=999995 width=7)
             (actual time=0.065..6425.966 rows=1000000 loops=1)
 Total runtime: 10089.392 ms
(3 rows)

-- remove the index
DROP INDEX test_table_idx;

-- select the unique IPs
EXPLAIN ANALYZE SELECT DISTINCT ip FROM test_table;

                              QUERY PLAN
----------------------------------------------------------------------
 Unique  (cost=141909.78..146909.82 rows=1000008 width=7)
         (actual time=5965.684..10284.653 rows=999892 loops=1)
   ->  Sort  (cost=141909.78..144409.80 rows=1000008 width=7)
             (actual time=5965.679..7893.319 rows=1000000 loops=1)
         Sort Key: ip
         Sort Method:  external merge  Disk: 18560kB
         ->  Seq Scan on test_table
                 (cost=0.00..14902.08 rows=1000008 width=7)
                 (actual time=0.012..1304.278 rows=1000000 loops=1)
 Total runtime: 11370.620 ms
(6 rows)

This results in three important observations:

  • By using the proper data type, you may noticeably decrease the occupied disk space. When compared to the table with VARCHAR column, the table with INET column is for about 20% smaller, and the index on INET column occupied just about 70% of the space occupied by the index on VARCHAR column.
  • When using indexes, there's not a noticeable difference between the queries (10 seconds in case of INET column vs. 11 seconds in case of VARCHAR column), as the indexes radically decrease the number of necessary comparisons (and thus the difference does not emerge).
  • In case of queries without indexes, the INET column clearly wins (11 seconds compared to 24 seconds in case of VARCHAR column).

Yes, your're right - those differences are not as dramatic as in the first part of this series, demonstrating that a proper index definition and usage may result in several orders of magnitude differences. But 50% difference is not something to be sneezed at, right?

Comments

There are no comments for this article (or are awaiting acceptance).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)