Benchmark results / HDD + TPC-H
Last week I've posted a bief analysis of the read-only and read-write OLTP parts of the benchmark, now it's time to look into the part dealing with DWH/DSS workload, based originally on the TPC-H benchmark. In brief, the average results for various block sizes are these

A more detailed description and definition of m-score will be presented later, for now it should suffice that higher values mean better performance. It's clear that larger blocks (file system and database) yield better performance when querying the data.
But does this hold for other tasks common in DWH applications, i.e. loading the data, creating indexes, foreign keys, collecting stats etc. And how do the various file systems perform?
As all the file systems perform about the same, this article analyzes the average behaviour first and then points out file systems that are somehow exceptional. If you want to see how results for the individual file systems, see those comparisons:
- comparing all file systems
- comparing file systems with write barriers enabled
- comparing file systems with write barriers disabled
Or you can check the list ofo results and choose whatever systems you want to compared. This article uses results with write barriers enabled, but disabling them really does not change the results (the read-only parts are not influenced at all and the result in read-write parts are consistently faster, i.e. all file systems gain about the same performance boost).
The TPC-H part of the benchmark was performing about these basic steps:
- load data to tables
- creating foreign keys and indexes
- collect statistics (ANALYZE with default_statistics_target=1000)
- queries (22 queries defined in the TPC-H benchmark)
And those steps are analyzed in the following sections ...
Loading the data
Let's see the data load first, i.e. how long it took (in seconds) to load all the data into bare tables (no indexes, no foreign keys, etc.) using a COPY from a CSV file.

Clearly, the file system block size is very important (4kB blocks are about about 60% faster than 1kB blocks), although the database block size matters too (but the difference between 1kB and 32kB is just about 15%).
Those dependencies are not equally strong for all file systems - e.g. for XFS, the file system block size is completely irrelevant and the database block size influence is very weak.

Ext4 performs significantly better than ext3, no matter what data mode you use. For example for data=journal mode, the results are


I.e. for 4kB file system blocks, ext4 is about 10% faster, for 1kB blocks the difference is about 40%. The improvement is even more noticeable for the "data=ordered" mode


i.e. the difference is about 20% for 4kB blocks and 50% for 1kB blocks. The "data=writeback" mode performs about the same as "dataordered."
The last file system worth mentioning is JFS, as it provides very good performance - it's actually a bit faster than the plain (not journaling) ext2

The other file systems (btrfs, nilfs2) provide about the average performance.
Foreign keys and indexes
The same, i.e. providing almost exactly the average performance, is true for creating indexes and foreign keys


The differences between file systems are noticeable mostly for the small file sytems blocks - see this comparison. For large file system blocks the worst performing file system is ext3 with "data=journal".

A very good performance in both tasks (creating indexes and foreign keys) provides XFS, and not just that it's fast, but the performance is not influenced by thee file system block size:

Now the data are loaded and we have created indexes and foreign keys, so it's time to collect stats (and then perform the queries, finally).
Collecting stats
The stats are collected using ANALYZE command, and the default_statistics_target value was increased to 1000. Once again, all the file systems performed about the same - very near the average performance

The only file system that stood out was nilfs2, as it provided rather bad performance

but as you can see, that's true just for the small file system blocks, for 4kB blocks it's not that far from the average.
Score
Before talking about how the file systems performed when executing queries, let me explain what a score is and why it's defined the way it is.
The TPC-H benchmark defines 22 types of queries, and when running this benchmark, each query was executed exactly once. At first it's logical to measure time for all queries, and use the sum for comparison (and this is marked as time in the results). But there are several problems
- If the database chooses an inefficient plan, the query may be running for a very long time and that would significantly extend the benchmark. If you set a timeout, you really don't know the actual time for the query that exceed it (all you know is that it's longer than the timeout).
- The goal of the benchmark is to compare file systems - but changing the database block size may cause a query plan change (because of different cost estimates - see the details here). So the same query may be evaluated in a different way, but those results are quite useless when you want to compare file systems.
- The runs that evaluate the most expensive queries fast are somehow favored - consider for example two queries, the first run evaluates then in 1s and 50s while a second run evaluates them in 10s and 40s. That means the total time for the first run is 51s and 50s for the other one, so the second run is "faster." But the first run executed the first query 10x faster and was by just 20% slower in case of the other query ...
The first two points may be easily solved by using just those queries that finished fine for all the compared runs, and where the execution plan did not change. This also means that the results are valid for the comparison only - if you choose a different set of runs, the score will be different.
You can actually do this filtering for time - that's what m-time is (aka "modified time"), i.e. it's a sum of time for all queries that finished fine and where the execution plan did not change.
The last point may be solved by defining a "score" that compares time to run a query agains a time of the fastest run, i.e.
minimal_query_titme / query_time
Thus the fastest evaluation gets score 1, the other runs get score inversely proportional to the run time.
Example - two runs, three queries (bold is used to emphasize the fastest run of a query)
| query n. 1 | query n. 2 | query n. 3 | |
| run n. 1 | 10 s | 15 s | 100 s |
|---|---|---|---|
| run n. 2 | 20 s | 20 s | 75 s |
z čehož pro jednotlivé dotazy plynou tato ohodnocení a celkové skóre
| query n. 1 | query n. 2 | query n. 3 | total score | |
| run n. 1 | 10/10 = 1 | 15/15 = 1 | 75/100 = 0.75 | 2.75 |
|---|---|---|---|---|
| run n. 2 | 10/20 = 0.5 | 15/20 = 0.75 | 75/75 = 1 | 2.25 |
By applying all three rulee, you'll get m-score (aka "modified score"). Just keep in mind that the values are not absolute and depend on the compared runs - if you compare runs A and B, you'll get different results than when comparing runs A and C.
If you want to compare total run time, use m-time. If you want to favor relative speed of each query, use the m-score.
m-time a m-score
The average m-time and m-score values are these


In both cases the performance is much better with large database block sizes. The differences between file systems are quite small.
Just like when loading data, the ext4 gives significantly better results than ext3 - for example for "data=journal" mode (notice the small file system blocks):


XFS provides very good performance, consistent across all block sizes

And again, nilfs2 performance is a bit disappointing

Conclusion
It's quite obvious that
- Just like for OLTP workload, it does not make sense to use small file system block sizes - 4kB blocks may not improve the performance (see for example XFS), but they never hurt it
- The dependency on database block size is clear - the larger the better, and the best performance (no matter if you talk about time or score) is achieved for 32kB blocks. When using smaller block sizes, the performance degrades continuously.
- If I had to choose a winner, I'd vote for XFS - not that it's very fast in all cases (actually significantly faster than average in all cases), but the performance is almost constant for all block sizes between 8kB and 32kB, which makes life much easier when you need to run various workload types.




