PostgreSQL has to die
When developing applications, every now and then it's necessary to test application's behaviour in case the database server suddenly "dies." In the PostgreSQL mailing list there was an interesting thread, describing several interesting ways how to simulate this (from very elegant ones to quite brutal ones). Let's see various ways of killing PostgreSQL ...
It's somehow naive to suppose the database will be available 100% of the time. Although you may believe the implementation of your database is flawless and will never "crash," there are many other possible problems that may cause non-availability of the database. Among others:
- failure of the PostgreSQL process - errors in the implementation of the database server, errors or call to abort() in a C procedure, ...
- administrator action - PostgreSQL server shutdown, reboot of the system, ...
- problems related to the operation system - problems with resources (shortcoming of the file descriptors, etc.)
- hardware problems - limited responsiveness due to I/O susbystem overload, ...
- network problems - broken connections, high latency, ...
It's important that different causes may result in different problems in the application - in some cases the application does receive information about closed connection, sometimes it just waits forever, sometimes the server replies with huge latency, etc.
If you're developing an important application that demands maximal availability, it's not good to rely on faultless database, network or hardware. On the contrary it's desirable that the application reacts correctly to the non-availability of the database (e.g. so it does not overburden the network and does not cause more problems than benefits), and may even recover on it's own.
Let's see the ways to simulate the possible causes of database non-availability ...
Failure of the PostgreSQL process
The first possibility is to simulate the failure of the PostgreSQL backend, servicing a given connection. As suggested by Craig Ringer, this may be simulated e.g. by calling a C procedure, that calls abort() function, or that performs an invalid operation (e.g. dereferencing a null pointer).
Besides that it's possible to kill the given process at the OS level - i.e. by calling a well known command 'kill'
kill -9 PID
where PID denotes ID of the backend process (you may use other signals, not only 9 i.e. KILL - for example SEGV seems to be a quite interesting option).
Another possibility is to use pg_ctl command
pg_ctl -m immediate stop
that performs "hard" shutdown of the PostgreSQL cluster.
Don't be fooled that there is no difference between PostgreSQL backend failure and crash of the whole server - it's not true. Details may be found in Craig Ringer's post, so let's check at least a brief summary.
PostgreSQL failure backend
In case of PostgreSQL backend failure, the client is quite well informed. If the client has an established connection to a failed backend, and sends a packet, it will receive a TCP RST response, signalling that the connection is broken. The server may even send a TCP FIN message, signalling that it's closing the connection.
If a client attempts to establish a new connection to a server, and the PostgreSQL is not running, it will receive TCP RST reply, i.e. it will be actively rejected.
On the other hand, if the backend is running but is unresponsive (e.g. there is a deadlock), the client will receive ACK resonse (a confirmation that the packets were delivered), but nothing will happen - at least until the buffers fill up). From the TCP point of view, everything works - the packets are delivered, and the unresponsiveness is not a TCP problem.
failure of the whole server
In case the whole server is unavailable, the situation is much more complicated, and maybe much worse, as the clients may not receive any response to the packets it sends. The server may even stop responding to ARP packages - in this case the nearest router may (but not necessarily) send an ICMP message 'not reachable.' Due to this, there may be enormous delays (timeouts) on the client side, until the TCP stacks decides that the connection "died," so the client may block on recv() and read(), etc.
I doubt you want to 'kill' the whole server brutally (although development only), but this may be quite well simulated at the network level - see below.
Problems at the operating system level
As to the problems at the operating system levels, let's focus on the problems related to the resources (memory, opened files, ...), respectively to the limits set by the operating system. In case the limit is reached, the operating system has to react somehow (kill a chosen process to release the memory, prevent opening of another file, etc.).
As noted by Vick Khera, one of such limits is for example the number of open files - either total or "per process." Reaching the "per process" limit (option max_files_per_process in postgresql.conf) is not very probable. Much more probable is reaching the limit on total number of open files, and simulating this is not very difficult - find out how many files are necessary by a single PostgreSQL proces, and decrease the limit so that starting a backend process causes reaching that limit.
Generally it's recommended to set the maximal number of PostreSQL backends (max_connections) so that after multiplying the maximal number of open files "per process" (max_files_per_process) does not exceed the system limit.
Hardware problems
Hardware problems - power failure, memory errors, drive errors - usually result in failure of the whole system. But there are even hardware problems that may result in overloading the I/O subsystem (problems with disk controllers, RAID arrays, etc.) and subsequently in an low responsibility of the whole system.
One of the options how to simulate this is using the SIGSTOP signal as posted by Craig Ringer.
Network problems
Probability of the network problems (high latency, broken connections, high packet loss, etc.) dramatically depends on you network architecture. If you have two servers, placed in the server room next to each other, the probability of such problems is quite low.
On the contrary if you have a PostgreSQL server and many users (using e.g. desktop applications connection the the database) from the whole building or even remotely (from the home using a dial-up etc.), the network problems may be expected and the application should be able to react.
Two simple and straightforward ways to simulate the inaccessibility, based on shutting down or blocking the network interface, as suggested by Greg Sabino Mullane and Ray Stell
# shutdown of the interface (Ray Stell) $ ifconfig ethx down # blocking the interface (Greg Sabino Mullane) $ iptables -I INPUT -p tcp --dport 5432 -j DROP
If you need to simulate more complex network problems, you may use LARC (Linux Advanced Routing & Traffic Control) project, recommended by Craig Ringer, or Net:Netem recommended again by Craig Ringer.
As noted above, it's possible to simulate the failure of the whole server at the network level - just configure the network so that all the packets are lost.
Actions of the administrator
No one in unmistakable, and administrators are not an exception - every now and then an administrator does a mistake. For example it executes "sudo reboot" in a console pointing to a production PostgreSQL server, stops the PortgreSQL cluster by mistake, etc.
The inaccessibility may be caused by valid actions of the administrators, not by a mistake - for example in case of upgrade, etc.
Nevertheless the simulation of such problems is not difficult - either directly or by using instructions introduced in the previous sections.



