OpenStats - partitioning
Just as in case of PGMon project, when collecting this kind of data there may be huge amounts of data, but these data go out quite quickly - in most cases it's not necessary to work with data older that 1 month. There are two consequences of the quick growth - decrease of performance and problems with maintenance (backup and removal of old data).
It's not surprising that, just as in case of PGMon project, partitioning of the main tables (Sessions, Actions and Parameters) is an elegant solution of these problems. But while in case of PGMon the partitioning setup was quite simple, in this case the the setup is seriously complicated due to foreign keys between the tables.
Partitioning of dependent tables
Let's talk about two tables connected by a foreign key:
CREATE TABLE A ( id INTEGER PRIMARY KEY ); CREATE TABLE B ( id INTEGER PRIMARY KEY, a_id INTEGER ); ALTER TABLE B ADD FOREIGN KEY (a_id) REFERENCES A (id);
By splitting both tables by the "id" column (i.e. A.id and B.id) in two parts in this way:
CREATE TABLE A_1 ( CHECK id < 1000000 ) INHERITRS (A);
CREATE TABLE A_2 ( CHECK id >= 1000000 ) INHERITRS (A);
CREATE TABLE B_1 ( CHECK id < 1000000 ) INHERITRS (B);
CREATE TABLE B_2 ( CHECK id >= 1000000 ) INHERITRS (B);
CREATE RULE A_insert_1 AS ON INSERT TO A WHERE (id < 1000000)
DO INSTEAD INSERT INTO A_1 (NEW.*);
CREATE RULE A_insert_2 AS ON INSERT TO A WHERE (id >= 1000000)
DO INSTEAD INSERT INTO A_2 (NEW.*);
CREATE RULE B_insert_1 AS ON INSERT TO B WHERE (id < 1000000)
DO INSTEAD INSERT INTO B_1 (NEW.*);
CREATE RULE B_insert_2 AS ON INSERT TO B WHERE (id >= 1000000)
DO INSTEAD INSERT INTO B_2 (NEW.*);
there will be two partitions for each table, but it's not possible to define referential integrity between them. For a row in the B_2 partition it's not possible to determine whether the row referenced by "a_id" column was places into partition A_1 or A_2. A trick with linking partitions B_1 and B_2 directly to table A does not work, as the child tables are not searched when checking the foreign key.
This problem may be solved in three ways:
- accept nonexistence of referential integrity (not acceptable for me - I do care about data integrity)
- write your own triggers to check referential integrity (quite difficult to implement correctly and optimize)
- separate tables "in the same way" (according to same values) i.e. so that it's possible to define referential integrity between partitions
I guess I don't have to emphasize that I recommend the third option - partition the tables so that it's possible to define referential integrity between those parts. E.g. in the example listed above with tables A and B this means separating the table B according to the column B.a_id instead of B.id. So the definition of table B partitions should be
CREATE TABLE B_1 ( CHECK a_id < 1000000 ) INHERITRS (B);
CREATE TABLE B_2 ( CHECK a_id >= 1000000 ) INHERITRS (B);
CREATE RULE B_insert_1 AS ON INSERT TO B WHERE (a_id < 1000000)
DO INSTEAD INSERT INTO B_1 (NEW.*);
CREATE RULE B_insert_2 AS ON INSERT TO B WHERE (a_id >= 1000000)
DO INSTEAD INSERT INTO B_2 (NEW.*);
Due to this small modification both tables will be partitioned "in the same way" i. e. we can be sure that tables from B_1 partiton will reference only rows in partition A_1, and rows in partition B_2 will reference only rows in partition A_2, so we can define foreign keys:
ALTER TABLE B_1 ADD FOREIGN KEY (a_id) REFERENCES A_1 (id); ALTER TABLE B_2 ADD FOREIGN KEY (a_id) REFERENCES A_2 (id);
This may be compiled into the following rule:
If the table B depends on table A, i. e. column B.a_id is a foreign key referencing the column A.id, then you should partition both tables according to this column, i. e. table A should be partitioned according to column A.id and tabe B according to column B.a_id.
Application to tables Sessions, Actions and Parameters
Let's apply the rule mentioned above to the core tables of the OpenStats project - Sessions, Actions and Parameters. This is the reason why the Sessions.id column is propagated (maybe artificially) to the Parameters table.
Table Sessions should be partitioned according to column "id":
CREATE TABLE Sessions_1 ( CHECK id =< 100000 ) INHERITS (Sessions);
CREATE TABLE Sessions_2 ( CHECK id > 100000 AND id =< 200000 ) INHERITS (Sessions);
CREATE TABLE Sessions_3 ( CHECK id > 200000 AND id =< 300000 ) INHERITS (Sessions);
...
CREATE RULE Sessions_insert_1 AS ON INSERT TO Sessions
WHERE (id =< 100000)
DO INSTEAD INSERT INTO Sessions_1 (NEW.*);
CREATE RULE Sessions_insert_2 AS ON INSERT TO Sessions
WHERE (id =< 200000 AND id > 100000)
DO INSTEAD INSERT INTO Sessions_1 (NEW.*);
CREATE RULE Sessions_insert_3 AS ON INSERT TO Sessions
WHERE (id =< 300000 AND id > 200000)
DO INSTEAD INSERT INTO Sessions_1 (NEW.*);
...
Table Actions should be partitioned according to the column "session_id":
CREATE TABLE Actions_1 ( CHECK session_id =< 100000 ) INHERITS (Actions);
CREATE TABLE Actions_2 ( CHECK session_id > 100000
AND session_id =< 200000 ) INHERITS (Actions);
CREATE TABLE Actions_3 ( CHECK session_id > 200000
AND session_id =< 300000 ) INHERITS (Actions);
...
CREATE RULE Actions_insert_1 AS ON INSERT TO Actions
WHERE (session_id =< 100000)
DO INSTEAD INSERT INTO Actions_1 (NEW.*);
CREATE RULE Actions_insert_2 AS ON INSERT TO Actions
WHERE (session_id =< 200000 AND session_id > 100000)
DO INSTEAD INSERT INTO Actions_1 (NEW.*);
CREATE RULE Actions_insert_3 AS ON INSERT TO Actions
WHERE (session_id =< 300000 AND session_id > 200000)
DO INSTEAD INSERT INTO Actions_1 (NEW.*);
...
And finally, the table Parameters table should be partitioned just as Actions table:
CREATE TABLE Parameters_1 ( CHECK session_id =< 100000 ) INHERITS (Parameters);
CREATE TABLE Parameters_2 ( CHECK session_id > 100000
AND session_id =< 200000 ) INHERITS (Parameters);
CREATE TABLE Parameters_3 ( CHECK session_id > 200000
AND session_id =< 300000 ) INHERITS (Parameters);
...
CREATE RULE Parameters_insert_1 AS ON INSERT TO Parameters
WHERE (session_id =< 100000)
DO INSTEAD INSERT INTO Parameters_1 (NEW.*);
CREATE RULE Parameters_insert_2 AS ON INSERT TO Parameters
WHERE (session_id =< 200000 AND session_id > 100000)
DO INSTEAD INSERT INTO Parameters_1 (NEW.*);
CREATE RULE Parameters_insert_3 AS ON INSERT TO Parameter
WHERE (session_id =< 300000 AND session_id > 200000)
DO INSTEAD INSERT INTO Parameters_1 (NEW.*);
...
Now the partitioning is almost done - you should create new partitions continuously, either automatically or by hand.




