PostgreSQL has to die

When developing applications, every now and then it's necessary to test application's behaviour in case the database server suddenly "dies." In the PostgreSQL mailing list there was an interesting thread, describing several interesting ways how to simulate this (from very elegant ones to quite brutal ones). Let's see various ways of killing PostgreSQL ...

It's somehow naive to suppose the database will be available 100% of the time. Although you may believe the implementation of your database is flawless and will never "crash," there are many other possible problems that may cause non-availability of the database. Among others:

  • failure of the PostgreSQL process - errors in the implementation of the database server, errors or call to abort() in a C procedure, ...
  • administrator action - PostgreSQL server shutdown, reboot of the system, ...
  • problems related to the operation system - problems with resources (shortcoming of the  file descriptors, etc.)
  • hardware problems - limited responsiveness due to I/O susbystem overload, ...
  • network problems - broken connections, high latency, ...

It's important that different causes may result in different problems in the application - in some cases the application does receive information about closed connection, sometimes it just waits forever, sometimes the server replies with huge latency, etc.

If you're developing an important application that demands maximal availability, it's not good to rely on faultless database, network or hardware. On the contrary it's desirable that the application reacts correctly to the non-availability of the database (e.g. so it does not overburden the network and does not cause more problems than benefits), and may even recover on it's own.

Let's see the ways to simulate the possible causes of database non-availability ...

Failure of the PostgreSQL process

The first possibility is to simulate the failure of the PostgreSQL backend, servicing a given connection. As suggested by Craig Ringer, this may be simulated e.g. by calling a C procedure, that calls abort() function, or that performs an invalid operation (e.g. dereferencing a null pointer).

Besides that it's possible to kill the given process at the OS level - i.e. by calling a well known command 'kill'

kill -9 PID

where PID denotes ID of the backend process (you may use other signals, not only 9 i.e. KILL - for example SEGV seems to be a quite interesting option).

Another possibility is to use pg_ctl command

pg_ctl -m immediate stop

that performs "hard" shutdown of the PostgreSQL cluster.

Don't be fooled that there is no difference between PostgreSQL backend failure and crash of the whole server - it's not true. Details may be found in Craig Ringer's post, so let's check at least a brief summary.

PostgreSQL failure backend

In case of PostgreSQL backend failure, the client is quite well informed. If the client has an established connection to a failed backend, and sends a packet, it will receive a TCP RST response, signalling that the connection is broken. The server may even send a TCP FIN message, signalling that it's closing the connection.

If a client attempts to establish a new connection to a server, and the PostgreSQL is not running, it will receive TCP RST reply, i.e. it will be actively rejected.

On the other hand, if the backend is running but is unresponsive (e.g. there is a deadlock), the client will receive ACK resonse (a confirmation that the packets were delivered), but nothing will happen - at least until the buffers fill up). From the TCP point of view, everything works - the packets are delivered, and the unresponsiveness is not a TCP problem.

failure of the whole server

In case the whole server is unavailable, the situation is much more complicated, and maybe much worse, as the clients may not receive any response to the packets it sends. The server may even stop responding to ARP packages - in this case the nearest router may (but not necessarily) send an ICMP message 'not reachable.' Due to this, there may be enormous delays (timeouts) on the client side, until the TCP stacks decides that the connection "died," so the client may block on recv() and read(), etc.

I doubt you want to 'kill' the whole server brutally (although development only), but this may be quite well simulated at the network level - see below.

Problems at the operating system level

As to the problems at the operating system levels, let's focus on the problems related to the resources (memory, opened files, ...), respectively to the limits set by the operating system. In case the limit is reached, the operating system has to react somehow (kill a chosen process to release the memory, prevent opening of another file, etc.).

As noted by Vick Khera, one of such limits is for example the number of open files - either total or "per process." Reaching the "per process" limit (option max_files_per_process in postgresql.conf) is not very probable. Much more probable is reaching the limit on total number of open files, and simulating this is not very difficult - find out how many files are necessary by a single PostgreSQL proces, and decrease the limit so that starting a backend process causes reaching that limit.

Generally it's recommended to set the maximal number of PostreSQL backends (max_connections) so that after multiplying the maximal number of open files "per process" (max_files_per_process) does not exceed the system limit.

Hardware problems

Hardware problems - power failure, memory errors, drive errors - usually result in failure of the whole system. But there are even hardware problems that may result in overloading the I/O subsystem (problems with disk controllers, RAID arrays, etc.) and subsequently in an low responsibility of the whole system.

One of the options how to simulate this is using the SIGSTOP signal as posted by Craig Ringer.

Network problems

Probability of the network problems (high latency, broken connections, high packet loss, etc.) dramatically depends on you network architecture. If you have two servers, placed in the server room next to each other, the probability of such problems is quite low.

On the contrary if you have a PostgreSQL server and many users (using e.g. desktop applications connection the the database) from the whole building or even remotely (from the home using a dial-up etc.), the network problems may be expected and the application should be able to react.

Two simple and straightforward ways to simulate the inaccessibility, based on shutting down or blocking the network interface, as suggested by Greg Sabino Mullane and Ray Stell

# shutdown of the interface (Ray Stell)
$ ifconfig ethx down

# blocking the interface (Greg Sabino Mullane)
$ iptables -I INPUT -p tcp --dport 5432 -j DROP

If you need to simulate more complex network problems, you may use LARC (Linux Advanced Routing & Traffic Control) project, recommended by Craig Ringer, or Net:Netem recommended again by Craig Ringer.

As noted above, it's possible to simulate the failure of the whole server at the network level - just configure the network so that all the packets are lost.

Actions of the administrator

No one in unmistakable, and administrators are not an exception - every now and then an administrator does a mistake. For example it executes "sudo reboot" in a console pointing to a production PostgreSQL server, stops the PortgreSQL cluster by mistake, etc.

The inaccessibility may be caused by valid actions of the administrators, not by a mistake - for example in case of upgrade, etc.

Nevertheless the simulation of such problems is not difficult - either directly or by using instructions introduced in the previous sections.

Comments

There are no comments for this article (or are awaiting acceptance).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)