Benchmark results / HDD + read-only pgbench

In the previous post, I've briefly described benchmark results, especially how to interpret the images. Now it's time to analyze the first part of the results, namely the read-only portion of the OLTP workload (i.e. the output of the read-only pgbench runs).

TPS over all tested file systems (detailed comparison is available here), with XFS excluded (it's the only system with 512B blocks and makes the image a bit less readable) looks like this

average read-only tps for all the filesystems

The image clearly shows that the best performance is achieved for small database blocks and large file system blocks. The dependency on file system block size is not equally strong for all the file systems - for some file systems it's quite strong, for other file systems there's almost no dependecy.

For example for the ext4 file system with "data=journal" the results (see the image below) show that especially for mid-sized database blocks (4kB - 16kB) the dependency on file system block size is quite obvious:

read-only tps for ext4-journal, with write barriers enabled

The file systems ext2, ext3 (all data modes) and nilfs2 behave in exactly the same way. On the contrary, for file systems xfs and ext4 (ordered and writeback) the dependency on the block size is almost imperceptible - for example ext4-ordered behaves like this:

read-only tps for ext4-ordered, with write barriers enabled

The charts also show that although the optimal database block size is near 1kB or 2kB, most of the performance is gained between 8kB and 16kB, and using smaller blocks has only minor impact. Let's see how this block size works for the other workloads (read-write pgbench and TPC-H).

Shared buffers and page cache

Generally speaking, the results for all the file systems with read-only workload are quite even (with several exceptions, mentioned at the end). One of the reasons is that shared buffers hit ratio does not depend on the file system block size or type.

The cache hit ratio (the percentage of blocks found in the shared buffers) on average looks like this:

average db cache (shared buffers) hit ratio

and for XFS the hit ratio is

db cache (shared buffers) hit ratio for XFS

so it's almost exactly equal to the average. The deviations for other file systems are about +/- 1%.

Similarly for the whole cache (including file system page page cache, managed by the kernel) the differences are minimal. On average it looks like this

average cache (shared buffers + page cache) hit ratio

and results for individual file systems are almost exactly the same (+/- 1%). The only exception is nilfs2, where the success rate is significantly lower (especially for smaller file system blocks).

nilfs2 cache (shared buffers + page cache) hit ratio

Note: The cache hit ratio is determined from the pgbench log, i.e. it's a percentage of transactions (not blocks) resolved completely from the cache. Transactions that take less than 1ms are considered as a "hit," longer transactions are a "miss" (a transaction that had to access the drive). In comparison with db cache this is significantly less precise - take this into account when using it.

TPS course

What I find quite interesting is the tps course during the benchmark run, or rather the differences caused by different database block sizes (the file system block size has almost no impact on it). Generally all the file systems behave almost equally - let's see for example ext4 with "data=journal" and blocks 2kB, 8kB and 32kB.

TPS during the benchmark (DB block size 2kB)

TPS during the benchmark (DB block size 8kB)

TPS during the benchmark (DB block size 32kB)

Obviously the smaller the block, the longer the "warmup" (there are more shared buffers) - you can see this as the initial growth at the beginning of the chart. But after the warmup, smaller block sizes yield higher tps values (up to 310 for 2kB blocks, 250 for 8kB blocks and 220 for 32kB blocks).

Specifics of file systems

In the course of the read-only benchmark, there were just very few unexpected or somehow exceptional results. The first example is btrfs, where "nodatacow" mode (i.e. with "copy-on-write" disabled) is actually a bit slower that the variant that copies the data.

TPS for btrfs with copy-on-write enabled

TPS for btrfs with copy-on-write disabled (nodatacow)

Clearly, the "nodatacow" variant is a bit slower for all block sizes, although the difference is very small. Sure, it's a read-only benchmark, but even in that case I'd expect the "nodatacow" to be a bit faster (smaller amount of data etc.).

The second, and a bit unpleasant, surprise is the behaviour of reiserfs, that is significantly slower (about 10%) compared to the other file systems.

TPS for reiserfs

Reiserfs is my favourite file system, a few years it was the first journaling file system, based on B+ trees, but it obviously does not suit a read-only database.

Conclusion

The results mentioned above suggest several basic rules for read-only workloads

  • It does not make much sense to use small file system blocks - 4kB blocks may not improve the performance in some cases (e.g. with ext4-ordered), but it does not hurt the performance in any case.
  • The optimal database block size is about 2kB, but the performance does not drop significantly until 16kB blocks (the drop is about 10% compared to 8kB blocks), the difference between 2kB and 8kB blocks is about 5%.

Comments

There are no comments for this article (or are awaiting acceptance).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)