Common problems with SQL - Too clever constructs
It is said that "A road to hell is paved with good intentions," and it's true even for SQL queries. A lot of problems is caused by developers who strive to optimize the database, frequently without necessary knowledge of the internals.
Endeavour to unload the database using LIMIT / OFFSET
Consider two systems that need exchange data periodically, and it may not be realized using a shared database - say for example two web applications on separate servers. One of the options is creating a script (e.g. using PHP) that - after verifying login and password - writes all the required data to the output in a CSV format. Let's say there is nontrivial amount of data (tens of thousands of rows, each of them occupying severals kilobytes), so that reading the data means reading tens of MB of data.
In one unnamed e-shop exactly this solution was implemented - there was one central table of products, with about the following structure (in reality there was much more columns, but it's not relevant to the demonstration):
CREATE TABLE Products (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
description TEXT,
export BOOLEAN NOT NULL DEFAULT 't',
in_stock INTEGER NOT NULL DEFAULT 0,
price NUMERIC(10,2)
);
a several "associated" websites periodically - even several times a day - loaded the actual list of products. The table contained about 40.000 records and about 20.000 of them was exported. Exporting script was implemented like this:
-- counting the products to export (export AND (in_stock > 0))
$count = number_of_products_to_export();
-- compute number of blocks (each containing 100 items)
$numberOfBlocks = ceil($count / 100);
FOR i IN 0..($numberOfBlocks - 1) LOOP
-- load and write a block
SELECT * FROM Products WHERE export AND (in_stock > 0)
ORDER BY id LIMIT 100 OFFSET ($i*100);
END LOOP;
Focus especially on the partitioning of the whole list into blocks. The developer's intention was to "optimize" the memory consumption (the original practice was to load the whole resultset at once and then write it to the output) and performance.
Let's fill the table with testing data (50.000 rows)
db=# INSERT INTO Products
SELECT i AS id,
md5(i::text) AS name,
repeat(md5(i::text), 100) AS description,
(random()*100 < 95) AS export,
round(random()*10) AS in_stock,
(round(random()*10000) / 100) AS price
FROM generate_series(1,50000) s(i);
check how the first block will be processed:
EXPLAIN ANALYZE SELECT * FROM Products
WHERE export AND (in_stock > 0)
ORDER BY id LIMIT 100 OFFSET 0;
QUERY PLAN
---------------------------------------------------------------------
Limit (cost=0.00..7.97 rows=100 width=758)
(actual time=0.034..0.441 rows=100 loops=1)
-> Index Scan using products_pkey on products
(cost=0.00..3936.25 rows=49412 width=758)
(actual time=0.031..0.218 rows=100 loops=1)
Filter: (export AND (in_stock > 0))
Total runtime: 0.584 ms
(4 rows)
If all the 500 parts would be evaluated this way, the total time would be about 300ms. The problem is the time consumed by a given part depends on the offset - more precisely, it increases linearly with the offset - in this case for about 30ms each 10.000 rows.
db=# EXPLAIN ANALYZE SELECT * FROM Products
WHERE export AND (in_stock > 0)
ORDER BY id LIMIT 100 OFFSET 10000;
QUERY PLAN
---------------------------------------------------------------------
Limit (cost=796.62..804.59 rows=100 width=758)
(actual time=29.784..30.178 rows=100 loops=1)
-> Index Scan using products_pkey on products
(cost=0.00..3936.25 rows=49412 width=758)
(actual time=0.027..19.315 rows=10100 loops=1)
Filter: (export AND (in_stock > 0))
Total runtime: 30.323 ms
(4 rows)
db=# EXPLAIN ANALYZE SELECT * FROM Products
WHERE export AND (in_stock > 0)
ORDER BY id LIMIT 100 OFFSET 20000;
QUERY PLAN
---------------------------------------------------------------------
Limit (cost=1593.24..1601.20 rows=100 width=758)
(actual time=62.815..63.225 rows=100 loops=1)
-> Index Scan using products_pkey on products
(cost=0.00..3936.25 rows=49412 width=758)
(actual time=0.027..38.742 rows=20100 loops=1)
Filter: (export AND (in_stock > 0))
Total runtime: 63.378 ms
(4 rows)
db=# EXPLAIN ANALYZE SELECT * FROM Products
WHERE export AND (in_stock > 0)
ORDER BY id LIMIT 100 OFFSET 30000;
QUERY PLAN
---------------------------------------------------------------------
Limit (cost=2389.86..2397.82 rows=100 width=758)
(actual time=91.332..91.742 rows=100 loops=1)
-> Index Scan using products_pkey on products
(cost=0.00..3936.25 rows=49412 width=758)
(actual time=0.028..58.388 rows=30100 loops=1)
Filter: (export AND (in_stock > 0))
Total runtime: 91.889 ms
(4 rows)
So the average time necessary to evaluate a single part is about 75ms, which yields about 37,5 seconds for 500 parts. Let's try to use a simple modification - remove the LIMIT / OFFSET clause.
im=# EXPLAIN ANALYZE SELECT * FROM Products
WHERE export AND (in_stock > 0)
ORDER BY id;
QUERY PLAN
---------------------------------------------------------------------
Index Scan using products_pkey on products
(cost=0.00..3936.25 rows=49412 width=758)
(actual time=0.031..128.397 rows=48017 loops=1)
Filter: (export AND (in_stock > 0))
Total runtime: 182.205 ms
(3 rows)
As you can see, thanks to removing the "clever LIMIT/OFFSET construct" the processing was much faster.
Regarding the effort to save memory by LIMIT / OFFSET clause, it's much more effective not to load the whole resultset at once, but row by row.




