Benchmark results / HDD + read-write pgbench
I've already briefly analyzed results of the read-only pgbench runs, now it's time to discuss the read-write workload. The average tps for all file systems with write barriers enabled looks like this

That clearly shows that, just like in case of a read-only workload, better performance is achieved with smaller database blocks. The difference between 1kB and 32kB blocks is about 30%, depending on the file system block size.
Write barriers allow to use volatile cache (caches not backed up) without the risk of serious file system corruption. If the file system support it and if you use cache with a BBU, you may disable the write barriers - you'll get a much better performance, as you can see on this image

Obviously, for small database blocks you may get significant performance gain (up to 100%), and this was with just a very small cache (32MB drive cache). At the same time, the difference between small and large database blocks increased significantly.
Maybe you've noticed a bit strange anomaly in the upper left corner of the images, where the file systems perform much differently compared to the other block sizes. Interestingly, the behaviour is exactly opposite when the write barriers are enabled or disabled. For example the ext3-ordered file system with write barriers enabled behaves like this

i.e. the performance in the upper left corner is a bit lower, while with write barriers disabled the file system behaves like this

i.e. the performance in the upper left corner is significantly better. This behaviour is common all ext3 and ext4 file systems, although in case of "data=journal" with write barriers it's not very apparent

And it's not just about ext3/ext4 - XFS with write barriers disabled shows this anomaly too

Database cache hit ratio
The database cache hit ratio, i.e. the percentage of database blocks found in the shared buffers (i.e. without the need to read them from the disk) is quite high - actually significantly higher than with a read-only workload (where it was between 60% and 70%).

Just like in the read-only case, the highest hit ratio is achieved for 2kB blocks, but the dependency on the block size is quite weak - the values are in a 2% interval (with read-only workload the interval was about 10% wide).
TPS
While with a read-only workload the file systems behaved almost exactly the same, with read-write behaviour the file systems behave very differently. So let's see the file systems one by one - I've chosen the common block sizes (4kB for the file system, 8kB for a database) for comparison.
There are several interesting aspects to look for
- What is the performance when there is a checkpoint in progress?
- What is the performance when there is not a checkpoint in progress?
- Is the performance stable or does it fluctuate a lot?
Are those charts are with write barriers enabled - if interested in behaviour without write barriers, check the results here.
Ext2
In case of this "traditional" linux file system, the checkpoint behaviour is quite intrusive, i.e. once a checkpoint is triggered the performance immediately drops (from 150 tps to less than 20 tps)

Interestingly the behaviour is much better for other block sizes - e.g. for 16kB database blocks it behaves like this

i.e. the value fluctuates around 100 tps, but there's no brutal drop as with the 8kB blocks. Interestingly the fluctuations start before the actual checkpoint and last even when it ends ...
ext3
In case of ext3 (and the same holds for ext4) it's necessary to consider the data mode. With the lowest data mode (data=writeback), the chart looks like this

After enabling the journal (data=ordered), the off-checkpoint behaviour does not change (the tps remains about 120), but the performance during a checkpoint is significantly lower (with a much higher variance)

And finally, with full journal enabled (data=journal) the tps drops from 120 to 80, and the fluctuation during a checkpoint changes a bit

Generally speaking the ext3 pleased me - I've expected the behaviour to be much worse.
ext4
File system ext4 is basically an offspring ext3 file system, enhancing for example a checkpoint behaviour and write barriers. It uses the same data modes - "data=writeback" behaves almost exactly the same as for ext3 (but the performance is a bit better)

while the "ordered" is significantly smoother compared to ext3 (it's almost equal to "writeback")

The journal "mode" behaves exactly the same as with ext3

So yes, it seems like a step in the right direction, although not too big.
XFS
File system XFS is designed to be journaling file system right from the beginning (unlike the ext3/ext4 filesystems that are built on top of ext2, which is not journaling). But the constraints are rather weak and comparable to "writeback" mode of ext3/ext4. XFS behaves like this

Obviously the performance is not as good as ext4-writeback (about 100tps compared to 130tps) and the same holds for the ordered data mode. But when a checkpoint is running, the behaviour is actually much better (better performance, less fluctuation). Compared to ext4-journal, xfs is actually a bit faster even when a checkpoint is not running (about 100 tps compared to 80 tps).
Note: The original comparison to ext4-journal and conclusions were a bit misleading, as
jfs
Another "native journalling" file system is JFS, but it is not that widely supported or developed. The performance when the checkpoint is not in progress is very interesting (140 tps is significantly better than 100 tps achieved with XFS), but when the checkpoint is in progress the performance drops.

The behaviour during a checkpoint depends on database block size - e.g. when using a smalled block size (4kB) the behaviour is this:

But even this behaviour is quite bad - the performance significantly fluctuates, especially when compared to XFS and ext3/ext4.
ReiserFS
ReiserFS behaviour is quite interesting and may be compared to ext4 or XFS - the performance is about equal to XFS but he fluctuation during a checkpoint resembles ext3/ext4.

That's all about "traditional" file systems, now it's time to discuss two file systems marked as "experimental" - btrfs and nilfs2.
btrfs
Btrfs file system has two basic modes - either you can enable or disable "copy-on-write." Let's see the behaviour with "copy-on-write" disabled, i.e. when the btrfs behaves similarly to traditional file systems

This behaviour resembles to ext4-journal, both during a checkpoint and when the checkpoint is not in progress. Let's see how btrfs behaves with a "copy-on-write" enabled

That's a bit surprising result - the performance did not drop at all, the behaviour during a checkpoint actually significantly improved. Don't ask me how they did it ...
nilfs2
The second experimental "log-based" file system is nilfs2, developed at NTT. And just like btrfs, the results are very interesting.

It's not as stable as with btrfs, but there checkpoint behaviour is quite good. Actually the performance at the beginning is significantly better than with btrfs, and if the NTT developers will be able to improve the gradual decrease, it's very promising.
In any case, the improvement is possible, because for smaller database blocks the behaviour is much better. For example with 1kB blocks it looks like this

which looks great. But this may be influenced by the background writer config, because with 1kB blocks the background writer it actually writes 8x less data than with 8kB blocks.
Surprises
The first surprise, just like with the read-only workload, is that btrfs with nodatacow (i.e. with copy-on-write disabled) is slower - not much, but it is.


The nodatacow option actually improves the performance only with write barriers disabled - in this case the nodatacow is much faster


Conclusion
Let's talk about some basic rules, just like in case of the read-only workload
- It does not make much sense to use small file system block sizes - 4kB block may not improve the performance (e.g. with ext4-ordered), but it never hurts it.
- With write barriers enabled, the optimal database block size is 4kB (more precisely it's equal to the file system block size), but the differences between various block sizes are minimal.
- With write barriers disabled, the optimal block size is about 2kB and the performance significantly decreases as the block size increases. Unlike the read-only workload, most of the performance is not lost between 8kB and 16kB but between 2kB and 4kB.
- Ext4 and XFS are a good choice among the stable file systems. Both experimental file systems - btrfs and nilfs2 - look very promising too.





Thanks for this thorough benchmark and analysis. It's very rare to find someone going to this length to gather reliable data.
However, some of your comments about file systems are incorrect as far as I can tell.
"With a disabled journal (data=writeback)"
data=writeback does not "disable" the journal, it merely relaxes ordering constraints. File system metadata is still journalled, but the ordering between writing metadata (such as block maps) and file contents is not guaranteed. In the case of a crash, recently extended files (where writes didn't complete) may appear to have data from other, previously deleted files -- since the space was allocated, but never written to. But that's completely safe with PostgreSQL since it replays all non-fsync'ed writes from WAL anyway.
"File system XFS is designed to be a fully journaling file system right from the beginning (i.e. it's not a hybrid as the "ext" file systems)"
This is misleading. XFS's level of journalling pretty weak -- similar to ext* data=writeback. But instead of unrelated data appearing in files, XFS will zero out the incompletely written blocks. It doesn't support ext's equivalent of 'ordered' and 'journal' modes. So if anything, ext* is more "fully" journalling than XFS. Of course for PostgreSQL workloads, this isn't useful.
Thanks for pointing out the misleading statements. I'm not an expert in this field and although I've learnt a lot when working on this benchmark, there's still a lot of things to learn. I'll update the article.