PGMon - partitioning

The amount of data collected by PGMon may be quite significant, and it may cause serious problems regarding their processing (the more data the slower queries) and maintenance (collected data become stale quite quickly, and deleting large amounts of data may cause problems in PostgreSQL). Partitioning represents relatively effective solution of these problems.

Let's talk about a middle-sized database - 100 tables, 200 indexes (two for each table - primary key and another index), and lets estimate the amount of data collected if the samples are read each 5 minutes. For each sample about a table resp. index about 360B resp. 320B is needed. Each month there will be about 8928 rows for each object, giving 892800 rows for tables and 1785600 rows for indexes. Regarding the size, this means about 300MB of data about tables, and about 540MB of data about indexes.

This amount of data may mean serious problems - a simple rule "the more data, the slower queries" holds. Collected data do go out quite fast (you'll use data older that a month only rarely) so you don't need to keep too old data on-line, i.e. you may backup and delete old data. But removal of large amounts of data of large may cause problems with PostgreSQL (bloating).

An effective solution of these problems is partitioning according to date of collection, i.e. usage of inheritance and triggers or rules. Partitioning is not used in the standard distribution of PGMon project, but it's quite simple to define it - you just need to define partitions and rules redirecting the data to the proper partition.

The following SQL script creates two partitions of the "Tables" table for January and February 2009, and two rules redirecting the data:

-- partition for January 2009
CREATE TABLE Tables_2009_01 (
    PRIMARY KEY (stat_time, dbmane, schemaname, tablename),
    CHECK ( stat_time >= '2009-01-01' AND stat_time < '2009-02-01' )
) INHERITS (Tables);

-- partition for February 2009
CREATE TABLE Tables_2009_02 (
    PRIMARY KEY (stat_time, dbmane, schemaname, tablename),
    CHECK ( stat_time >= '2009-02-01' AND stat_time < '2009-03-01' )
) INHERITS (Tables);

-- rule for January 2009
CREATE RULE Tables_insert_2009_01 AS
ON INSERT TO Tables WHERE
    ( stat_time >= '2009-01-01' AND stat_time < '2009-02-01' )
DO INSTEAD
    INSERT INTO Tables_2009_01 VALUES (NEW.*);

-- rule for February 2009
CREATE RULE Tables_insert_2009_02 AS
ON INSERT TO Tables WHERE
    ( stat_time >= '2009-02-01' AND stat_time < '2009-03-01' )
DO INSTEAD
    INSERT INTO Tables_2009_02 VALUES (NEW.*);

When defining the partitions, you should choose the interval carefully - not too large, not too short. According to my experience a month is a wise choice, and I use it in most cases. In several cases I've used a week or month, but shorter or longer intervals are not very useful.

You have to create partitions regularly - it does not matter if you do that manually or automatically (by a cron script for example), even if the automatic way seems better to me. But don't create too much partitions at once, as all the rules are evaluated in all cases, and that may cause performance problems for large number of rules. I do recommend dropping rules for old partitions (e.g. previous month) for the same reason.

And yet another note about rules - you have to explicit values of all the columns used in conditions of RULES (it's not possible to use NULL and rely on the default value defined for the table), as the values are needed when evaluating the rules.

Comments

There are no comments for this article (or are awaiting acceptance).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)