Filesystem vs. PostgreSQL block size
Recently I've been asked how is the database performance affected by the block size (for filesystem and database), especially with respect to SSD drives that have 512kB internal blocks while traditional spinning HDD use 512B or 4096B sectors.
When someone asks about this on pgsql-performance list, the usual answer is "Use 8kB, Luke!" because 8kB is a reasonable compromise and using a different block size probably won't give you much. But is that really true? Is that true for all file systems? And does that hold for all workloads? And what about the SSD drives? And what about the filesystem block size?
Interesting questions, but it's impossible to answer them without hard data ... time for a benchmark!
Initially I wanted to run just few quick tests on the SSD drive I've bought a few weeks ago (Intel 320, luckily it did not shrink to 8MB yet), but then I started to add more and more options to the benchmark. First I realized I should run the same tests on a regular HDD for comparison. And when I'm doing all this, I should not test just ext4 I use most of the time, but other filesystems too (ext2, ext3, reiserfs, jfs, xfs). And what about various journaling methods (writeback, ordered, journal) in case of ext3/ext4 filesystems? And what the new experimental filesystems, btrfs and nilfs2, that might be interesting too.
And when I'm going to test this, let's not test just using the OLTP-like workload generated using pgbench, let's use some DWH-like benchmark. That's basically the reason why I was working on a TPC-H benchmark recently.
I believe the collected data may be interesting to others, so I've decided to publish it here. This post is just an initial description of the benchmark - what was tested, what data were collected etc. I'm still analyzing the results, I'll make them available over the next few days.
If you want to dig in the data on your own, the HDD results are available here (200MB archive, about 750MB after extraction). Results for SSD will be published later.
But I don't want just to publish here the commented results - I'd like to get some feedback. Do you think the results are strange, because of some stupid config? Do you think it's possible to tune one of the filesystems to give much better results? Well, let me know - either by leave a comment below the article, by e-mail, or start a thread on a suitable mailing list.
What was tested
So what combinations of parameters were actually tested? First, the filesystem block size (and in case of ext3/ext4 journaling methods).
| souborový systém | velikosti bloku | další volby | počet kombinací |
| ext2 | 1024, 2048, 4096 | 3 | |
| ext3 | 1024, 2048, 4096 | writeback, ordered, journal, (no)barrier | 18 |
| ext4 | 1024, 2048, 4096 | writeback, ordered, journal, (no)barrier | 18 |
| jfs | 4096 | 1 | |
| xfs | 512, 1024, 2048, 4096 | (no)barrier | 8 |
| reiserfs | 4096 | flush/none | 2 |
| btrfs | 4096 | (no)datacow, (no)barrier | 4 |
| nilfs2 | 1024,2048,4096 | (no)barrier | 6 |
| celkem | 60 |
That makes 60 combinations. PostgreSQL allows to set the block size to 1kB, 2kB, 4kB, 8kB, 16kB and 32kB, which increases the number of combinations to 360 (6x60). And by running this on SSD and HDD, the number of combinations grows to 720.
Update 2011/08/09: As intgr suggested, I've added btrfs with "nodatacow" option, so the number of filesystems increased to 32 (and number of tested combinations to 192).
Update 2011/08/09: When analysing the results I've realized some of the results are not comparable, because the file systems are paranoind and thus use write barriers (ext4, xfs, nilfs2 and btrfs) while the others don't. That obviously makes the results uncomparable, so I have to add other 17 combinations, i.e. about 4 days of runtime, and delay the results a bit.
Update 2011/08/30: I've updated the table to describe the final benchmark, with 360 combinations tested (instead of the original 180) - mostly due to adding barrier/nobarrier variants.
I'm not an expert in the field of file systems, so I haven't tuned any other parameters except those mentioned above (block size and journaling method), with the exception of SSD-related options (TRIM etc.). If you believe some option for one of the filesystems might significantly improve the results, let me know and I'll try to verify that. This is particularly true for nilfs2 and btrfs, as I don't have much experience with them.
Note: PostgreSQL allows to set block size for data files and WAL segments independently, but I've decided to ignore. The number of combinations would grow 6x to 2232, and with just one hour per combination that would mean about 93 days of runtime. So both values were always set to the same value and WAL block size 64kB was not tested at all (datafile block size is at most 32kB).
Benchmark description
The benchmark consists of two main parts:
-
pgbench (TPC-B-like stress test, mostly OLTP workload)
- scale=300
- read-only test - 5 minutes, 10 clients (HDD) or 20 clients (SSD)
- read-write test - 5 minutes, 10 clients (HDD) or 20 clients (SSD)
- shared_buffers = 1GB, effective_cache_size = 512MB
- work_mem = 16MB, maintenance_work_mem = 256MB
- checkpoint_segments = 16
- postgresql-hdd-pgbench.conf
- postgresql-ssd-pgbench.conf
-
TPC-H-like benchmark (DSS/DWH workload)
- scale=2
- time for each of the 22 queries
- shared_buffers = 1GB, effective_cache_size = 512MB
- work_mem = 128MB, maintenance_work_mem = 256MB
- checkpoint_segments = 16
- default_statistics_target = 1000
- postgresql-hdd-tpch.conf
- postgresql-ssd-tpch.conf
Originally I wanted to run the benchmark with full 4GB of RAM and about 8GB of data, but that turned out to be too time consuming because each run (on HDD) took about 2 hours. With 186 days that would mean 15 days of runtime (not counting runs on SSD). So I've reduced the amount of memory to 2GB (using a "mem=2G" kernel parameter), and the amount of data to 4GB (i.e. about 2x the amount of RAM).
There's a wide amount of information collected during the benchmark, e.g.:
- pgbench: tpc/s, log with individual transactions (for latency analysis)
- tpc-h: runtime for each query
- iostat / vmstat - each 15 seconds
- cluster statistics (pg_stat_database, pg_stat_bgwriter) - before and after each benchmark
- information about load, creating indexes etc.
- explain plans for TPC-H benchmark (i.e. 22 queries)
- vmstat -s, vmstat -d before/after each benchmark phase (r/o, r/w, tpc-h) and before/after each query in tpc-h benchmark
- a lot more other information ...
Not all the data will be used for the following posts, that'd be way too much information, but I'll provide all the data and it's useful to know about it.
Environment
The benchmark was executed on the following hardware:
- Intel i5-2500 (4 cores, 3.3 GHz)
- 4GB DDR3 RAM, 1333 MHz (limited to 2GB using "mem=2G")
- MB Asus P8Z68-V PRO
- Seagate Constellation ES, 1TB, SATA 3Gb/s (model ST31000524NS)
- SSD Intel 320, 120GB, SATA 3Gb/s (model SSDSA2CW120G310)
So it's a quite good workstation. The used drives (SSD and HDD) are not high-end but good drives for reasonable price. However the goal of this benchmark was not to benchmark particular drive models but rather general behavior with various types of drives. It's irrelavant whether the drive is 5% faster or slower compared to other drive.
The drives were dedicated to the benchmark - there was nothing else, except the benchmark data. Operating system and PostgreSQL binaries were located on another drive.
Write cache on the drive was enabled - I was thinking about disabling it, but in principle you can see that as a controller with small write cache (64MB). Disabling the cache would not signigicantly change the results - I've tried to run some of the tests and the difference was negligible and mostly consistent across all filesystems.
The operating system was 64-bit Gentoo Linux with 2.6.39-gentoo-r3 kernel, and the basic tuning was done. E.g. swapping in sysctl.conf
vm.swappiness = 0 vm.overcommit_memory = 2
lower limits for dirty items in page cache and noop elevator for the SSD device
# echo 10 > /proc/sys/vm/dirty_ratio # echo 5 > /proc/sys/vm/dirty_background_ratio # echo noop > /proc/block/sdc/queue/scheduler
Note: There are definitely interesting setups worth of testing. For example I'd like to see results for bit RAID arrays, depending on stripe size (just as this benchmark varies filesystem block size). Results for FreeBSD or a PPC machine could be interesting too, etc. If you can provide such system, let me know and I'll help you to run this benchmark (the script is available for download below).
Scripts and data
The whole benchmark is implemented as a simple shell script, available for download here zde. It was not designed to be extremely portable, but with a bit of effort you cn make it run on your system. After extracting the package, you'll get these items
- bench.sh - the main script that runs the whole benchmark
- drop-cache.sh - script used to clean page cache before the benchmark
- collect.php - script that processes the results and generates CSV file and images
- dss-bench/ - place for the TPC-H data (need to be generated)
- dss-bench/queries - place for the TPC-H queries (need to be generated)
The procedure is quite simple:
- download the package and extract it to a suitable directory
- generate the data and queries for TPC-H benchmark (see article about a TPC-H benchmark)
- place the TPC-H data (in CSV format) to the dss-bench directory
- place the TPC-H queries to the dss-bench/queries directory
- modify the configuration at the beginning of bench.sh (see below)
- run the bench.sh script
The configuration is done using variables at the beginning of the bench.sh script:
# device for the benchmark, where to mount it etc. DEVICE=/dev/sdg1 MOUNTPOINT=/mnt/pgdata DATADIR=$MOUNTPOINT/data GROUP=users # where are the postgresql binaries (with various block sizes) PGBINS=./pgbin RESULTS=./results-hdd # create this file to gracefully stop the benchmark STOPFILE=bench.stop # delay between stats collections (iostat, vmstat, ...) DELAY=15 # pgbench PGB_SCALE=300 PGB_CONFIG=./postgresql-hdd-pgbench.conf # pgbench / warmup PGB_WCLIENTS=1 PGB_WTIME=60 # 1 minute for warmup # pgbench / benchmark PGB_CLIENTS=10 PGB_TIME=300 # 5-minute benchmark # TPC-H benchmark DSS_CONFIG=./postgresql-hdd-tpch.conf DSS_TIMEOUT=300 # 5 minutes in seconds
Meaning of those variables is probably quite obvious, so just a very brief description
-
general options
- DEVICE - the partition used for the benchmark (you definitely want to change this)
- MOUNTPOINT - mount point for the device
- DATADIR - PostgreSQL data directory (subdirectory of MOUNTPOINT)
- GROUP - group owning the directory (set using sudo)
- PGBINS - directory where are PostgreSQL binaries for various block sizes (in subdirectories bs-1, bs-2, ..., bs-32)
- RESULTS - directory where the results are stored (in filesystem/pg-block-size/fs-block-size subdirectories, and then pgbench/ro, pgbench/rw or tpch)
- STOPFILE - by creating this trigger file, you may "gracefully" stop the benchmark (i.e. the currently running part will finish and then stop)
- DELAY - interval between iostat and vmstat statistics collection
-
pgbench options
- PGB_SCALE - determines the amount of data (passed as the "-s" option)
- PGB_CONFIG - PostgreSQL config file (will be copied to the DATADIR)
- PGB_WCLIENTS - number of clients for warmup before the benchmark
- PGB_WTIME - length of warmup (in seconds)
- PGB_CLIENT - number of clients for the benchmark
- PGB_TIME - length of the benchmark (in seconds)
-
TPC-H options
- DSS_CONFIG - PostgreSQL config file (will be copied to the DATADIR)
- DSS_TIMEOUT - timeout for the queries (then the query will be killed)
The benchmark executes several commands that require root privileges (e.g. mkfs, mount, drop page cache etc.) which is done through sudo, so the user running the benchmark needs to have the right to run sudo. For example you can add this line to the /etc/sudoers file
%wheel ALL=(ALL) NOPASSWD: ALL
which gives the right to run sudo without the need to enter password, to all member of the wheel group (or you can give that to a single user). There are probably other, maybe better, ways to achieve the same goal.
The benchmark may not run in one pass - it's possible to terminate it "gracefully." All you have to do is to create the file in STOPFILE variable (bench.stop by default):
$ touch bench.stop
The current step will finish and then the script will terminate correctly (removes working data, unmounts the device etc.).
Obviously, you have to remove the file before running the script again, otherwise the benchmark would terminate again. The already tested combinations of parameters will be skipped.
If you need to terminate the script immediately, just use Ctrl+C, but then you'll have to remove the output directories (see $RESULTS) with incomplete data.
Compiling PostgreSQL
The benchmark works with various PostgreSQL block sizes, which means you need several separately compiled PostgreSQL binaries (because block size is set during compilation). The benchmark expects to find directories bs-1, bs-2, ..., bs-32 in $PGBINS directory.
If you're in the directory with PostgreSQL source codes, you can do this:
#!/bin/sh
PGBINS=/home/user/pg-benchmark/pg-9.0.4/
for bs in 1 2 4 8 16 32; do
make distclean;
./configure --prefix=$PGBINS/bs-$bs \
--with-wal-blocksize=$bs \
--with-blocksize=$bs;
make install;
done;
And now everything should be prepared for the benchmark ...
Processing the data
The script collects a lot of various information (the more the better) that are written to the directory structure in $RESULTS. The collect.php script processes the collected data and generates a CSV file and a set of images.
$ php collect.data results-data results.csv





With btrfs, you should also benchmark options "nodatasum" and "nodatacow", that makes its behavior more similar to other file systems.
OK, thanks for pointing that out. I'll look into that and rerun the benchmark with these options.
I have to admit, I am a little skeptical of your postgres settings, have you done any benchmarks to tune them? I think if I had a server with 2GB of RAM, I'd be more inclined to do 512MB shared_buffers, and 1GB (or maybe 1.5GB) of effective_cache_size. Also I think you ought to do checkpoint_segments = 30, and checkpoint_completion_target = .9. This should increase average write load, while decreasing checkpoint spikes. While I suppose as long as you keep the config the same while swapping drives you'll still get an even comparison, I think it might be worth using values more likely to match a production system.
I guess you're right. I've done some basic tuning but the settings probably is not perfect. That's particularly true for checkpoint_segments and checkpoint_completion_target. I've realized this about half-way through the benchmark and I've decided to keep it like that. It's consistent for all the benchmarks and it should influence just the parts that write data.
I expect to rerun the benchmark with modified settings and comments like this are an important source of info, so thanks a lot for the feedback.
You should try different readahead settings !
For example, on Linux, by default, read ahead is 256KB.
Maybe you can try setting it to 1024, or 4096, or even 8192.
There is a huge difference on some systems (raid arrays, with no adaptive readahead raid controller). (I experienced x5 seq. throughput on 12x 300G SAS 15k, with no penalty on latency)
You can modify readhead using blockdev for example.
blockdev --getra /dev/sda
blockdev --setra 4096 /dev/sda
Thanks for your work, sounds interresting ! -:)
I've tried with different read-ahead setting (actually the second run was with 8192), but I haven't seen any significant difference. Might be caused by that I'm using a single 7.2k drive.
Any idea when you would be testing with SSD?
The benchmark with SSD is running right now - there are 360 runs in total, 200 already finished and each run takes about 40 minutes. That means the SSD results will be available on Friday evening or maybe after the weekend.