Results of the block-size benchmark for HDD - introduction
About two weeks ago I've mentioned my attempt to find out how does a block size - on filesystem and database level - influence the PostgreSQL performance. I already had some results at that time, but as several people pointed out, the configuration was not exactly perfect (especially when it comes to shared buffers and checkpoint configuration). In the end I've decided to rerun the benchmark with proper config and I've even added a few more options - so it run for two weeks instead of one. Today it finished (hooray!), so let's see some results.
The results are available here - in this post I'll try briefly to explain how to use them (what the images mean, how to read them etc.) and I'll mention several not so obvious problems that you should be aware of when comparing the results.
The OLTP and TPC-H results will be discussed in detail in the following posts.
As I've already mentioned, the benchmark results (for the HDD) are available here. The index page contains a list of the 24 file systems that were benchmarked

Click one of them and you'll see it's results, but you may compare several file systems too - just use the checkboxes and click the "compare" button below the table. And you may even download the results in CSV format.
Two image types
The results for each system are illustrated using two types of images. First type is an image like this:

The title describes the filesystem (in this case it's ext3 with data=journal and write barriers) and system configuration. The x-axis denotes PostgreSQL block size (1kB to 32kB), the y-axis denotes filesystem block size (actual values depend on the file system). Each cell contains value received for the filesystem and block sizes combination - for example the image above shows that this filesystem with 8kB db blocks and 2kB fs blocks reached 261.87 tps (in a read-only test).
The image is colored according to a scale (printed on the right side) - when you compare several file systems (using the "compare" button) the scale is common to all the images so that it's easier to compare them visually. In all other cases the scale is specific for each image.
Second type of images is used to illustrate the course of the 10-minute pgbench runs - using tps and latency. For example this image shows tps for a read-write test with xfs file system (with write barriers enabled) for 8kB database blocks and 4kB file system blocks. The red color is used for immediate value, blue color is used for 10-second averages.

The blue background area shows when a checkpoint was in progress (blue is for timed checkpoint, red for xlog checkpoint). In the image above, the first timed checkpoint happened at 4:00 and was running until 8:00 and when it finished, another timed checkpoint was started immediately. That's the expected behavior, because the checkpoint interval was set to 4 minutes and completion target to 0.9.
A complementing information is latency, illustrated on the following image (same file system, but a read-only test). The x-axis denotes time, the y-axis denotes latency.

As most transactions have low latency, those several transactions with high latency make it less useful as all the low latencies are stacked at the bottom of the y-axis. That's why there's a log-scale version of the image

This actually nicely shows which transactions were processed using a cache (latency is below 1ms) and which had to touch the drive. Even the latency charts show checkpoints, but in this case it's very short because it's a read-only test so there's nothing to write.
The tps and latency charts for various block sizes can't be displayed in a single image, so they're printed in a table next to each other. The database block size grows with rows (1kB, 2kB, ... up to 32kB), filesystem block size grows with columns.

Hopefully this description is enough - if you're not sure about something, ask me.
Problems
So let's talk about the problems you're going to face when you want to change the database block size, especially if you want to compare the results. In the PostgreSQL config, there are several items tightly related to the block size and yet it's not clear how to set them when you change the block size.
random_page_cost, seq_page_cost vs. block size
The first thing that comes to mind are cost variables - especially seq_page_cost and random_page_cost. Suppose we have a table with 100MB of data - that's about 13000 blocks with the default size 8kB. If you change the block size to 32kB, the number suddenly drops to about 3300. And due to this, the seq scan cost estimate drops too - but the amount of data remains the same.
Yes, the cost estimates are used to compare various plans - e.g. a seq scan vs. index scan, so the absolute value is not important. But the index scan cost estimate does not drop so fast, so it may happen that for large blocks the seq scan wins although an index scan would be more efficient (or vice versa - the index scan may win for small blocks).
This does not affect the pgbench much (they usually use indexes for all block sizes), but it plays quite important role in the TPC-H benchmark where the good plan choice is essential (some queries do not even finish for some block sizes).
I have not found a reasonable algorithm to modify the costs for each block size, so I've used the same values for all (implicit values for HDD, for SSD I've lowered the random_page_cost to 2).
If you have an idea how to change the costs, let me know.
background writer
Background writer, writing dirty blocks to the disk, is yet another problem. It's activity is influenced by bgwriter_lru_maxpages item, which means how many dirty blocks are written to the disk everytime it runs. The default value is 100, but that's 100kB for a 1kB blocks while 3.2MB for 32kB blocks.
The consequence is that it writes much more data for large blocks than for the small ones,, although the dataset and workloads are the same, as illustrated the following image:

i.e. about 80MBs were written for the 1kB blocks, but for 32kB blocks it was almost 1.5GB! Let's see how much of the data were written due to checkpoints and how much due to the background writer. (checkpoint first):

and now the background writer:

I.e. most of the data was really written by the background writer.
You could set the bgwriter_lru_maxpages item to represent a constant amount of data (e.g. 3.2MB), but it's not that simple. The large blocks are dirtied sooner and the db cache is smaller (contains smaller number of buffers) so it needs to be cleaned more often, so a more aggressive background writer probably makes sense.
Just as in case of the cost variables - if you have an idea how to set the bacground writer for various block sizes,let me know.
checkpointing
And finally just a notice about reliability of the checkpoint stats (number of blocks written etc.). The current checkpoint implementation does not report the progress, so the pg_stat_bgwriter is not updated until the checkpoint finishes (the same holds for writing the info to the log). So when the checkpoint is in progress when the pgbench run finishes, the checkpoint is not included in the stats.
Executing a forced checkpoint after the pgbench run does not help - it's not possible to find out how many buffers were written during the benchmark. I've prepared a small patch that makes this easier as it writes some basic progress information to the log.
That's all for now - in the next post I'll talk about the pgbench results.





The link to the benchmark scripts ( http://www.fuzzy.cz/download/bench-scripts.tgz ) doesn't seem to work . The results are interesting but it's hard to tell anything for sure without the benchmark code.
As for tpc-h queries execution plans - maybe turning off undesired plans types would work ?
Thanks for pointing out an invalid link - should be fixed now. I've actually updated it with the most recent version of the scripts, although the tweaks are rather small.
No, using enable_* variables to disable some some node types in not a solution. The execution plans tend to be quit complex, using multiple node types at the same time - e.g. reading one table with an index scan, the other one with seq. scan, using nested loop and merge join at the same time, etc.
The only reasonable way is to modify the cost variables, but even then I'd be surprised to get exactly the same plan for all queries / all block sizes.