OpenStats - partitioning

Just as in case of PGMon project, when collecting this kind of data there may be huge amounts of data, but these data go out quite quickly - in most cases it's not necessary to work with data older that 1 month. There are two consequences of the quick growth - decrease of performance and problems with maintenance (backup and removal of old data).

It's not surprising that, just as in case of PGMon project, partitioning of the main tables (Sessions, Actions and Parameters) is an elegant solution of these problems. But while in case of PGMon the partitioning setup was quite simple, in this case the the setup is seriously complicated due to foreign keys between the tables.

Partitioning of dependent tables

Let's talk about two tables connected by a foreign key:

CREATE TABLE A ( id INTEGER PRIMARY KEY );
CREATE TABLE B ( id INTEGER PRIMARY KEY, a_id INTEGER );

ALTER TABLE B ADD FOREIGN KEY (a_id) REFERENCES A (id);

By splitting both tables by the "id" column (i.e. A.id and B.id) in two parts in this way:

CREATE TABLE A_1 ( CHECK id <  1000000 ) INHERITRS (A);
CREATE TABLE A_2 ( CHECK id >= 1000000 ) INHERITRS (A);

CREATE TABLE B_1 ( CHECK id <  1000000 ) INHERITRS (B);
CREATE TABLE B_2 ( CHECK id >= 1000000 ) INHERITRS (B);

CREATE RULE A_insert_1 AS ON INSERT TO A WHERE (id <  1000000)
                          DO INSTEAD INSERT INTO A_1 (NEW.*);
CREATE RULE A_insert_2 AS ON INSERT TO A WHERE (id >= 1000000)
                          DO INSTEAD INSERT INTO A_2 (NEW.*);

CREATE RULE B_insert_1 AS ON INSERT TO B WHERE (id <  1000000)
                          DO INSTEAD INSERT INTO B_1 (NEW.*);
CREATE RULE B_insert_2 AS ON INSERT TO B WHERE (id >= 1000000)
                          DO INSTEAD INSERT INTO B_2 (NEW.*);

there will be two partitions for each table, but it's not possible to define referential integrity between them. For a row in the B_2 partition it's not possible to determine whether the row referenced by "a_id" column was places into partition A_1 or A_2. A trick with linking partitions B_1 and B_2 directly to table A does not work, as the child tables are not searched when checking the foreign key.

This problem may be solved in three ways:

  • accept nonexistence of referential integrity (not acceptable for me - I do care about data integrity)
  • write your own triggers to check referential integrity (quite difficult to implement correctly and optimize)
  • separate tables "in the same way" (according to same values) i.e. so that it's possible to define referential integrity between partitions

I guess I don't have to emphasize that I recommend the third option - partition the tables so that it's possible to define referential integrity between those parts. E.g. in the example listed above with tables A and B this means separating the table B according to the column B.a_id instead of B.id. So the definition of table B partitions should be

CREATE TABLE B_1 ( CHECK a_id <  1000000 ) INHERITRS (B);
CREATE TABLE B_2 ( CHECK a_id >= 1000000 ) INHERITRS (B);

CREATE RULE B_insert_1 AS ON INSERT TO B WHERE (a_id <  1000000)
                          DO INSTEAD INSERT INTO B_1 (NEW.*);
CREATE RULE B_insert_2 AS ON INSERT TO B WHERE (a_id >= 1000000)
                          DO INSTEAD INSERT INTO B_2 (NEW.*);

Due to this small modification both tables will be partitioned "in the same way" i. e. we can be sure that tables from B_1 partiton will reference only rows in partition A_1, and rows in partition B_2 will reference only rows in partition A_2, so we can define foreign keys:

ALTER TABLE B_1 ADD FOREIGN KEY (a_id) REFERENCES A_1 (id);
ALTER TABLE B_2 ADD FOREIGN KEY (a_id) REFERENCES A_2 (id);

This may be compiled into the following rule:

If the table B depends on table A, i. e. column B.a_id is a foreign key referencing the column A.id, then you should partition both tables according to this column, i. e. table A should be partitioned according to column A.id and tabe B according to column B.a_id.

Application to tables Sessions, Actions and Parameters

Let's apply the rule mentioned above to the core tables of the OpenStats project - Sessions, Actions and Parameters. This is the reason why the Sessions.id column is propagated (maybe artificially) to the Parameters table.

Table Sessions should be partitioned according to column "id":

CREATE TABLE Sessions_1 ( CHECK id =< 100000 ) INHERITS (Sessions);
CREATE TABLE Sessions_2 ( CHECK id > 100000 AND id =< 200000 ) INHERITS (Sessions);
CREATE TABLE Sessions_3 ( CHECK id > 200000 AND id =< 300000 ) INHERITS (Sessions); 
...

CREATE RULE Sessions_insert_1 AS ON INSERT TO Sessions
                              WHERE (id =<  100000)
                                 DO INSTEAD INSERT INTO Sessions_1 (NEW.*);

CREATE RULE Sessions_insert_2 AS ON INSERT TO Sessions
                              WHERE (id =<  200000 AND id > 100000)
                                 DO INSTEAD INSERT INTO Sessions_1 (NEW.*);

CREATE RULE Sessions_insert_3 AS ON INSERT TO Sessions
                              WHERE (id =<  300000 AND id > 200000)
                                 DO INSTEAD INSERT INTO Sessions_1 (NEW.*);
...

Table Actions should be partitioned according to the column "session_id":

CREATE TABLE Actions_1 ( CHECK session_id =< 100000 ) INHERITS (Actions);
CREATE TABLE Actions_2 ( CHECK session_id > 100000
                           AND session_id =< 200000 ) INHERITS (Actions);
CREATE TABLE Actions_3 ( CHECK session_id > 200000
                           AND session_id =< 300000 ) INHERITS (Actions); 
...

CREATE RULE Actions_insert_1 AS ON INSERT TO Actions
                    WHERE (session_id =<  100000)
                                 DO INSTEAD INSERT INTO Actions_1 (NEW.*);

CREATE RULE Actions_insert_2 AS ON INSERT TO Actions
                     WHERE (session_id =<  200000 AND session_id > 100000)
                                 DO INSTEAD INSERT INTO Actions_1 (NEW.*);

CREATE RULE Actions_insert_3 AS ON INSERT TO Actions
                     WHERE (session_id =<  300000 AND session_id > 200000)
                                 DO INSTEAD INSERT INTO Actions_1 (NEW.*);
...

And finally, the table Parameters table should be partitioned just as Actions table:

CREATE TABLE Parameters_1 ( CHECK session_id =< 100000 ) INHERITS (Parameters);
CREATE TABLE Parameters_2 ( CHECK session_id > 100000
                           AND session_id =< 200000 ) INHERITS (Parameters);
CREATE TABLE Parameters_3 ( CHECK session_id > 200000
                           AND session_id =< 300000 ) INHERITS (Parameters);
...

CREATE RULE Parameters_insert_1 AS ON INSERT TO Parameters
                    WHERE (session_id =<  100000)
                                 DO INSTEAD INSERT INTO Parameters_1 (NEW.*);

CREATE RULE Parameters_insert_2 AS ON INSERT TO Parameters
                     WHERE (session_id =<  200000 AND session_id > 100000)
                                 DO INSTEAD INSERT INTO Parameters_1 (NEW.*);

CREATE RULE Parameters_insert_3 AS ON INSERT TO Parameter
                     WHERE (session_id =<  300000 AND session_id > 200000)
                                 DO INSTEAD INSERT INTO Parameters_1 (NEW.*);
...

Now the partitioning is almost done - you should create new partitions continuously, either automatically or by hand.

Comments

PG Partitioning

I have the same issue but I have 12 other tables referencing the partitioned (40G : 76 million rows) one. It would be a royal pain to partition those 12 referencing tables too!

What do u think? =D

RE: PG Partitioning

That's hard to tell - the partitioning support in PostgreSQL is not perfect and in some cases it's very difficult to aplly (e.g. lack of foreign keys and problems with efficiency of some queries).

If you don't need partitioning (e.g. to delete data by dropping a partition) don't use it - that will save you a lot of problems.

But if you really need it, you can assure consistency even without foreign keys in some cases (e.g. by loading data in a given order, etc.). Or you can implement foreign keys on your own using triggers (but it's difficult to implement that correctly and efficiently).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)