Automatic management of partitions in PostgreSQL
If you plan to partition a continually growing table to parts of a limited size, you'll have to find a suitable way to create new partitions. It's not advisable to do that manually (at least on a production system), it's much better to do that automatically. Lets see several possible solutions ...
The purpose of this article is to demonstrate three possible ways to perform automatic management of partitions for a given table, compare their advantages and disadvantages, and analyze their usability for various scenarios.
Say we want to partition a table with this (very simple) structure
CREATE TABLE table_base ( id SERIAL PRIMARY KEY, created TIMESTAMP DEFAULT now() NOT NULL, data TEXT );
that is used to store more and more data over time. A typical example of such tables are tables used to store information about visits of a web application, tables used for logging, etc. The goal is to partition the table so that data are always written to a single partition, which is a big advantage when performing backups, recovery and removing stale data (as dropping whole partitions is much more effective than deleting the data row-by-row).
Such tables may be partitioned either by an artificial primary key (continuously growing value in a column "id" read from a sequence), or by a timestamp stored in a "created" column. In the former case the partitions may be of equal size (may contain the same number of rows), in the latter case the partitions may cover equally long intervals of time (e.g. day, week or a month). Obviously the choice of column used for partitioning is one of the crucial decisions - see below.
Another important decision is how the partitioning will be performed - if the data are "redirected" from base table to partitions using a trigger (with dynamic or static inserts), or by rules - a more detailed description of these ways may be found in my article comparing performance of partitioning using triggers and rules, but let's give at least a brief definition:
- rules - rows are redirected using rules created on a base table (one rule for each partition)
- triggers without dynamic SQL queries - rows are redirected to partitions using a trigger containing a separate INSERT statement (encapsulated in an IF condition)
- triggers with dynamic SQL queries - rows are redirected to partitions using a trigger containing a dynamic INSERT, executed using an EXECUTE command (i.e. the partition name is deduced from the row)
This article is not about partitioning itself, but about automating the process of creating / dropping partitions. Let's see what does it mean to create a new partition (and dropping an old one) in case of the three ways of partitioning, and related questions:
- creation of a new partition (table)
- creation of a rule for the new partition / modification of a trigger so that it inserts data into the new partition
- removal of a rule for the old partition / modification of a trigger so that it does not insert into the old partition
- removal of the old partition (table)
Creating a new partition
Another important choice is a proper timing of the partition creation - in general there are two options:
- when the partition is actually needed (a first row is inserted into it)
- in advance (e.g. when there are less than 1000 free IDs in the last partition, or if the time interval of the last partition ends in 10 minutes, etc.)
This is obviously related to timing removal of old partitions - see below.
Creating a rule resp. modifying a trigger
In case a new partition is created, it's necessary to modify the trigger or rules performing partitioning, so that data are inserted into the new partition too. What is necessary to do of course depends on how the partitioning is performed, namely:
- rules - you have to create a new rule (for the new partition)
- trigger without dynamic SQL queries - you have to add an INSERT for the new partition
- trigger with dynamic SQL queries - not needed to modify, works automatically
The question is - just as in case of partition creation - when should the partition be removed? The simplest answer is to perform the removal at the same time when creating a new partition.
The goal is to keep only the data that are still needed - e.g. in case of data about visits of a web application, this usually is not more than a few months (after processing and creating reports, the raw data are not needed anymore). That means you have to decide how many partitions should be kept in the database, and when creating a new partition a too old partition should be dropped. In the examples (see below) there are always only two partitions kept, but you may choose a different number.
As noted in the article comparing performance of partitioning using triggers and rules, the performance of the partitioning is inversely proportional to the number of "live partitions" i.e. partitions with corresponding rules on a base table, or partitions listed in the trigger. It's very important to remove unnecessary rules (and keep the partitioning trigger as simple as possible).:
- rules - you should remove the old (inactive) rule (keep the minimum number of active rules)
- trigger without dynamic SQL queries - you should remove the INSERT corresponding to the old partition (keep the number of IF branches as low as possible)
- trigger with dynamic SQL queries - not necessary to modify
Ways to automation
We still have to decide how the partition management will be automated. Generally there are two basic ways to do that:
- using a trigger, executed before inserting each row and checking if it is the "right time" to create a new partition or removing an old one, and if it is then it creates / drops them
- using a cron, i.e. a script periodically executed by a cron daemon (or any other way to execute a script periodically)
It might appear that both methods of automation are equal, but it's not true - compared to the cron script the trigger receives information about each new row. Lets check advantages and disadvantages of the two methods.
The cron method is based on a periodical execution of a (shell) script, that checks if it is necessary to create a new partition, and if it is it creates it (just like a trigger described in the next section), and maybe removes an old partition, modifies the rules and triggers, etc.
This solution has following advantages and disadvantages:
- (+) ideal for partitioning by time
- (+) there is no such overhead as with trigger (a procedure executed for each new row)
- (-) not very suitable for partitioning by ID - if the table grows faster than expected, the last partition may be filled up before a new one is created
This solution uses a "before insert" row-level trigger, that does generally the same as a cron script, i.e. it checks (for each new row) if it's necessary to create a new partition, does all the modofications of triggers and rules, etc.
The solution based on a trigger has these advantages and disadvantages:
- (+) solely database solution
- (+) ideal for partitioning by ID, i.e. if we want to create partitions containing a defined number of rows (probably equal for all partitions)
- (-) not very suitable for partitioning by time, i.e. if the partitions should contain data for given time intervals (day, week, month, ...) - the solution based on cron is more suitable for this (executing a trigger for each row is an unnecessary overhead)
Cron based automation
Lets see the cron solution first, as it is a little bit easier to understand and set up than trigger based automation. Lets start with partitioning by time, which fits the cron solution better than partitioning by ID, and problems related to it.
Even though the solution is based on cron, the main part of the logic may be implemented inside a PL/pgSQL procedure, exected by a cron. Name of this procedure is "manage_partitions" in all cases (partitioning by time or ID, performed by a trigger or rules), so all you have to do in cron is something like this:
0 0 10 * * echo "SELECT manage_partitions();" | psql my_database
This example will execute the "manage_partitions" procedure on 10th day of every month.
Partitioning by time
Say we want to partition the table by time, e.g. one partition for each month. In that case you have to create a partition for the next month before it begins, e.g. during the previous month, and maybe drop some very old partitions (e.g. older than two months, etc.).
So the procedure for management of partitions performs these steps:
- creates a base table "table_base" used as a base for the partitions (and no data will be inserted into it)
- creates a "manage_partitions" fuction performing the following steps
- drops a partition from the previous month (if it exists)
- creates a partition for this month (if it does not exist) - this is actually performed only once during the first execution, as it has to perform initialization, since then a partition for the current month should always exist
- creates a partition for the following month (if it does not exist)
- modifies a trigger / rules so that all existing partitions are considered (current and next month)
- the function "manage_partitions" is executed so that the partitioning is initialized (a default partition is created, including a trigger or rules)
Partitioning by ID
In case you want to partition the table by ID, the situation is a little bit more complicated - in contrast to partitioning by time, it's not clear when the last partition will "expire." I.e. while for partitioning by time it's obvious the last partition expires at the end of month, with partitioning by time it's clear the partition is valid for IDs between 1 and 100000, but we do not know when exactly will be the last ID (100000) used.
Sure, you may estimate how long will the partition suffice and when a new partition will be necessary, but that's may be unreliable. Because of this it is important to create the partitions in advance - this is why the procedure checks how many "free" IDs are there, and if there is less than a defined number of IDs a new partition is created.
So the script and procedure "manage_partitions" do perform these steps:
- creates a base table "table_base" used when creating the partitions (and no data will be inserted into it)
- creates a "manage_partitions" fuction performing the following steps
- checks if this is the first execution (no data inserted), and if yes then initializes the partitioning (creates first table, etc.)
- checks how many "free" IDs are in the partition, and if there's more than a defined number of IDs the function terminates, otherwise it continues to the following step
- creates a new partition (if it does not exist)
- drops an old partition (if it exists)
- modifies the trigger / rules so that all existing partitions are considered
- executes the "manage_partitions" function to initialize the partitioning (creates a default partition, along with a partitioning trigger or rules)
Compared to the cron solution, the solution based on triggers has several very interesting features, especially thanks to the fact a trigger receives information about each new row inserted into the table (and cron obviously does not receive this information). But of course nothing is perfect, and there are other problems associated (see below).
This method of automation is based on a trigger that checks if it is necessary to create a new partition (modify a trigger, rules, etc.). A few examples of possible conditions:
- current date is less than a week close to the end of the last partition
- current date is over the end of the last partition (i.e. does not fit into an existing partition)
- ID of the row is too large (e.g. less than 1000 from the end of the partition)
- ID of the row is over the end of the last partition (i.e. does not fit into an existing partition)
As each inserted row has to go through the trigger, there is no such danger of delay as with the cron based solution described above (exhausting space in the partition), and the difference between partitioning by ID and by time more or less disappears,
On the other side the usage of a trigger may cause some differences between a partitioning using rules and partitioning using a trigger.
Lets see what should the trigger performing management of partitions perform:
- check if a new partitions should be created (see examples of conditions above)
- if the condition is satisfied, the following steps are performed (more or less identical to the steps described in the cron solution):
- created a new partition (if does not exist)
- drops a too old partition (if exists)
- modifies trigger / rules so that the data are inserted into existing partitions
- the partitioning is initialized (creates a new partition and trigger / rules)
In general the management tigger is not very different from the management procedure executed from cron - the only thing that is different is the checking of the condition if a new partition should be created (it's true a similar contition is checked even in the cron solution in case of partitioning by ID).
But the partition management based on a trigger brings several (not so obvious) problems, that do not occur with the cron solution. Lets see three main problems.
The first problem is caused by the fact that while there will be only one running instance of the cron script, there may be many simultaneous session inserting data into the partitioned table (and executing the trigger). But this obviously leads to race conditions - just imagine this scenario
|step||session A||session B|
|1.||verification that a new partition should be created (true)|
|2.||verification that a new partition should be created (true)|
|3.||verification that the partition does not yet exist (doe not exist)|
|4.||verification that the partition does not yet exist (doe not exist)|
|5.||creation of the new partition (success)|
|6.||creation of the new partition (failure, the partition has been already created in step 5)|
So you have to use mutex (lock) at a proper place, but so that it does not affect the performance. That's why locking is performed after checking of the main condition - this way it will influence only inserts executed between executing the first insert satifying the condition, and end of its execution (so it will be less than a second).
Partitioning using rules
A different problem occurs when a rule-base partitioning is used. Lets see in what order are rules and triggers evaluated when a query in processed:
- the query is parsed and a "query tree" (not a query plan, rather a syntax tree representing the parsed query, a description may be found here and a more detailed description here) is built
- rules are applied to the "query tree" obtained in the previous step - rules are actually rewrite rules, and each rule gives us one new query tree (so this step actually returns a set of trees, one for each rule)
- each of the trees is executed (a query plan is prepared for it, and the query tree is executed - be aware that the WHERE condition associated with the rule is part of the query tree, and if it holds, the rest of the query tree - e.g. insert into another table - is executed too)
- when executing each of the query trees (see the previous step), triggers associated with the tables (partitions) are executed
But that means that it's impossible to influence the already running query by creating new rules, as triggers are executed in step (4), while rules are applied in step (2). So if the partitioning is performed using rules, the management trigger must not be associated with the base table (table_base), as it would be fired only for rows not matching any of the existing partitions. But a newly created rule would not apply to the row, and the row would be inserted into the base table (which is not desirable).
The only possible solution to define the trigger on the last partition (not on the base table). But this might be a problem if the table is partitioned by time.
Partitioning using triggers
A third problem may happen if the partitioning is performed by a trigger. In this case there is no such problem with different phases of query processing (applying rules before executing triggers), as described in the previous section. But you have to be very careful about the order of execution of the triggers (management trigger and partitioning trigger). The management trigger has to be executed first, before the partitioning trigger (redirecting rows into partitions).
In case the triggers were executed in the reverse order, there would be about the same problem as with rules (the trigger would have to be created on the last partition, and the new partition would be considered for the following inserts - not for the already running insert).
Say a new row is inserted, not fitting into any of the existing partitions, but the triggers are executed in the reverse order, i.e.
- partitioning trigger - redirects the row into an appropriate partition
- management trigger - creates a new partition, updates the partitioning trigger because of the new partition, etc.
The problem is the partitioning trigger is not aware of the new partition (because it does not exist yet). So the insert will either end with an error (if the partitioning trigger decides no data should be inserted into the base table), or it will permit the insert, but the changes made by the management trigger (definition of a new partition, modifications of the partitioning trigger) won't affect it and row will be inserted into the base table.
The solution to this problem is quite simple - execute the triggers in the right order:
- management trigger - creates a new partition, updates the partitioning trigger, ...
- partitioning trigger - redirects the row into the new partition (it's aware of it, as the management trigger was executed first)
And it's actually quite simple to do, as PostgreSQL executes triggers on a table in an alphabetical order. So all you have to do is to chose an appropriate name (in the examples I've used "manage_partitions" and "partition_rows").
Notice that it's this possibility to modify one trigger (partitioning trigger) from a trigger executed before it (management trigger) that makes it possible to omit the "safety limit" before the partition end, which is necessary in case of triggers. The new partition may be created when it's actually needed (a first row is inserted into it).
And now several final notes ...
Automatic dropping of partitions
In the examples available for download, "old" partitions are dropped automatically, which may not be quite desirable in the production environment as it may cause data loss - especially in case of partitioning by ID. It may happen that a new partition is created and dropped so fast that it is not backed up during a regular backup.
So think over how many old partitions you want to keep, and rather keep more (although you may remove IF conditions from the trigger, or rules).
Frequent inserting of rows in chronological order
All examples suppose that the rows are inserted in chronological order, and not too far from each other, i.e. the values in the columns "created" and "id" grow continuously, and that there are no huge differences between them.
For example when performing partitioning by months, and more than a month pass between inserting two rows, it's quite possible that the row does not match into any existing partition, resulting in an error.
- PostgreSQL Partitioning / Slicing and dicing your schema for quicker queries and dishing data (PGCon 2007)
- partitioning question -- how to guarantee uniqueness across partitions (pgsql-general mailing list post)