Benchmark results / HDD + read-only pgbench
In the previous post, I've briefly described benchmark results, especially how to interpret the images. Now it's time to analyze the first part of the results, namely the read-only portion of the OLTP workload (i.e. the output of the read-only pgbench runs).
TPS over all tested file systems (detailed comparison is available here), with XFS excluded (it's the only system with 512B blocks and makes the image a bit less readable) looks like this

The image clearly shows that the best performance is achieved for small database blocks and large file system blocks. The dependency on file system block size is not equally strong for all the file systems - for some file systems it's quite strong, for other file systems there's almost no dependecy.
For example for the ext4 file system with "data=journal" the results (see the image below) show that especially for mid-sized database blocks (4kB - 16kB) the dependency on file system block size is quite obvious:

The file systems ext2, ext3 (all data modes) and nilfs2 behave in exactly the same way. On the contrary, for file systems xfs and ext4 (ordered and writeback) the dependency on the block size is almost imperceptible - for example ext4-ordered behaves like this:

The charts also show that although the optimal database block size is near 1kB or 2kB, most of the performance is gained between 8kB and 16kB, and using smaller blocks has only minor impact. Let's see how this block size works for the other workloads (read-write pgbench and TPC-H).
Shared buffers and page cache
Generally speaking, the results for all the file systems with read-only workload are quite even (with several exceptions, mentioned at the end). One of the reasons is that shared buffers hit ratio does not depend on the file system block size or type.
The cache hit ratio (the percentage of blocks found in the shared buffers) on average looks like this:

and for XFS the hit ratio is

so it's almost exactly equal to the average. The deviations for other file systems are about +/- 1%.
Similarly for the whole cache (including file system page page cache, managed by the kernel) the differences are minimal. On average it looks like this

and results for individual file systems are almost exactly the same (+/- 1%). The only exception is nilfs2, where the success rate is significantly lower (especially for smaller file system blocks).

Note: The cache hit ratio is determined from the pgbench log, i.e. it's a percentage of transactions (not blocks) resolved completely from the cache. Transactions that take less than 1ms are considered as a "hit," longer transactions are a "miss" (a transaction that had to access the drive). In comparison with db cache this is significantly less precise - take this into account when using it.
TPS course
What I find quite interesting is the tps course during the benchmark run, or rather the differences caused by different database block sizes (the file system block size has almost no impact on it). Generally all the file systems behave almost equally - let's see for example ext4 with "data=journal" and blocks 2kB, 8kB and 32kB.



Obviously the smaller the block, the longer the "warmup" (there are more shared buffers) - you can see this as the initial growth at the beginning of the chart. But after the warmup, smaller block sizes yield higher tps values (up to 310 for 2kB blocks, 250 for 8kB blocks and 220 for 32kB blocks).
Specifics of file systems
In the course of the read-only benchmark, there were just very few unexpected or somehow exceptional results. The first example is btrfs, where "nodatacow" mode (i.e. with "copy-on-write" disabled) is actually a bit slower that the variant that copies the data.


Clearly, the "nodatacow" variant is a bit slower for all block sizes, although the difference is very small. Sure, it's a read-only benchmark, but even in that case I'd expect the "nodatacow" to be a bit faster (smaller amount of data etc.).
The second, and a bit unpleasant, surprise is the behaviour of reiserfs, that is significantly slower (about 10%) compared to the other file systems.

Reiserfs is my favourite file system, a few years it was the first journaling file system, based on B+ trees, but it obviously does not suit a read-only database.
Conclusion
The results mentioned above suggest several basic rules for read-only workloads
- It does not make much sense to use small file system blocks - 4kB blocks may not improve the performance in some cases (e.g. with ext4-ordered), but it does not hurt the performance in any case.
- The optimal database block size is about 2kB, but the performance does not drop significantly until 16kB blocks (the drop is about 10% compared to 8kB blocks), the difference between 2kB and 8kB blocks is about 5%.




