Demonstrating hot_standby overhead

A few days ago I went to Pilsen, to talk about database replication (general theory and some details about PostgreSQL replication) at a local JUG meeting. One of the questions you usually get after such talk is what is the overhead of a hot standby. The usual answer is "minimal" but that's not exactly a reliable answer ... so let's see a bit more thorough answer based on a simple benchmark.

In the replication and hot standby implementation in PostgreSQL, there are several places that may cause an overhead - first, there will be more data written into to the transaction log (WAL). Second, the streaming replication naturally brings overhead associated with network transfer of the data to the standby database. And finally, the hot standby may indirectly influence the primary database because of feedback (sending info about queries running on the standby etc.).

This article deals exclusively with the amount of WAL data - comparing amount of data written to WAL log with different wal_level settings, database activity (using pgbench).

I won't discuss the overhead of transfering the data to standby - I believe (or hope) you're using a reasonable network controllers, and with asynchronous replication this is not that big a deal (let's see the analysis in case of synchronous replication that will be in 9.1). Well, and analysis of feedback (see hot_standby_feedback) is too dependent on how the standby is actually used so I don't dare to present any analysis here.

Test procedure

The test procedure is very simple - create a database, populate it with data, run pgbench. Before and after each of this step check the position in xlog using pg_current_xlog_location and pg_xlogfile_name_offset functions - the former returns the current position encoded as uncomprehensible hex string, the latter parses it into file ID (again hex encoded) and a position within the file. So to get a current position you can do about this:

pgbench=# select pg_xlogfile_name_offset(pg_current_xlog_location());
       pg_xlogfile_name_offset      
-------------------------------------
 (00000001000000000000003D,10934732)

which corresponds to position 1,034,344,908B, because "3D" is 61 in decimal and 61*16MB + 10,934,732B is 1,034,344,908B. Do this before and after the operation, subtract the values and you'll get amount of WAL data written during the operation.

Note: I recommend to turn off autovacuum daemon, as it generates WAL data too when running, and it'd obviously interfere with our benchmark.

PostgreSQL supporrts three WAL levels, depending on what level of protection you need:

  • minimal - writes just the amount of WAL needed to do a local recovery (some bulk operations - e.g. COPY do not generate any WAL logs at this level)
  • archive - a recommended level if you want to use PITR and/or archive WAL logs (writes all operations including COPY etc.), and allows "warm standby"
  • hot_standby - new in 9.0, logs the same amount of data as archive level plus info about running transactions

For each of these levels several pgbench runs vere executed with various parameters - so that in all cases the total number of transactions was 10,000. The actual parameters were:

  • 1 client (10000 transactions)
  • 10 clients (1000 transactions for each)
  • 50 clients (200 transactions for each)
  • 100 clients (100 transactions for each)

The environment was completely rebuilt before each pgbench run, i.e. the database was dropped and populated. The details of how this was done can be seen in a very simple script used for testing - bench.sh. For each of the runs the amount of data written to log was measured, and the results are published in the following sections.

Results

Let's see a bit surprising results - fist there is a table of the data (the values indicate the amount of data written into the WAL log, in kB):

  minimal archive hot standby
init db 79 166679 166682
1 client 79767 5044 5045
10 clients 79573 5234 5236
50 clients 80751 5591 5592
100 clients 80718 5722 5739

When visualized using a chart, the data look like this:

množství wal dat pro různé úrovně wal_level - bez checkpointu

The database initialization (populating with data) seems OK - for "wal_level=minimal" there are almost no data written (which corresponds to the definition of the "minimal" level). But why does the actual benchmark run writes so much data compared to "archive" and "hot standby" levels (about 80MB vs. 5MB)?

That's a bit weird, right - the "minimal" level somehow subconsciously implies that it writes less data than the two other levels ... and not 15x more as in this case.

The cause of this lies in full_page_writes feature, that says that on the first change of a page since the last checkpoint you have to write the full 8kB page and not just the difference. But the "minimal" level behaves a bit like a checkpoint - it does not write any WAL data during the load, but on the first change since the load you need to write the complete page.

OTOH, at the "archive" and "hot standby" levels the complete page is written to the WAL during the load (as you can see in the chart), and then just the differences during the actual benchmark (which are much smaller).

In the extreme case the bechmark at the "minimal" level might write about the same amount of data as the load at the two other levels (archive / hot standby), i.e. the amount of WAL would be about the same. This would happen if the bechmark modifies all pages.

Results with an explicit checkpoint

So let's see how an checkpoint executed after populating the database influences the results:

  minimal archive hot standby
init db 79 166679 166746
1 client 79711 79296 79583
10 clients 79998 80367 80121
50 clients 79875 79949 79764
100 clients 80320 80006 80618

When visualized using a chart, the data look like this:

množství wal dat pro různé úrovně wal_level - bez checkpointu

As you can see, after an explicit checkpoint the differences between various WAL levels disappear. The only difference that remains is the small amount of WAL data written during load (COPY) at the "minimal" level.

As you can see from the numbers and chart, there's almost no difference between "archive" and "hot standby" levels - it might be a statistical error (the difference is just a few dozens or hundred bytes), so if you're already using "archive" level then the change to "hot standby" won't influence you at all.

If you're using just the "minimal" WAL level, it really does not matter if you set "archive" or "hot standby" - the amount of data is exactly the same. If you're considering to use "hot standby" then you probably neeed high availability and this is simply the price. Plus the advantages of "minimal" are related to bulk operations solely, so if you're not loading huge amounts of data using COPY every other day, you don't get any benefits (and what you gain you may actually loose because of the full_page_writes after the first checkpoint - see above).

Conclusion

As you can see from this simple test, overhead associated with the "hot standby" is really negligible (at least considering the amount of data written to WAL).

Comments

There are no comments for this article (or are awaiting acceptance).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)