OpenStats - database structure
Goal of this article is to describe the database structure used for OpenStats project in a nutshell (but with enough details). It contains description of individual tables and their columns, but an explanation of some seemingly illogical solutions.
Database of OpenStats project (resp. it's part used for collection of information), consists of the following tables:
- sessions - list of sessions and information about the visitor
- actions - list of actions for sessions (each session consists of one or more actions)
- parameters - list of parameters for actions (key / value pairs)
- pages - list of pages (each action is tied to a particular page)
- countries - list of countries (the visitors came from)
- languages - list of languages (the visitors are using)
- architectures - list of architectures (e.g. x86, mips, ...)
- browsers - list of browsers (MS IE, Firefox, Mozilla, ...)
- operating_systems - list of operating systems (Windows XP, Linux, ...)
- action_types - list of action types (pageview, submit form, ...)
First three tables (Sessions - Actions - Parameters) are crucial for the whole system, it's a core of the whole system and it's extended by the other tables. Let's take a look at individual tables, their structure, meaning, usage and connection to the other tables.
Sessions
This table contains information about the whole session, i.e. basic information about the visitor - IP address, information about the browser and operating system, country, and some HTTP headers. Structure of the table is this:
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| id | serial | no | primary key | |
| browser_id | integer | yes | deferred | |
| os_id | integer | yes | deferred | |
| arch_id | integer | yes | deferred | |
| country_id | char(2) | yes | deferred | |
| visitor_id | integer | no | ||
| ip | inet | no | ||
| http_ip_forward | varchar(255) | yes | ||
| http_referer | text | yes | ||
| http_user_agent | varchar(255) | yes | ||
| http_language | varchar(255) | yes | ||
| session_date | timestamp | no | now() | |
| browser_minor | varchar(16) | yes | ||
| browser_major | varchar(16) | yes | ||
| processed | boolean | no | false |
Some of the parameters (marked as "deferred" in the comment) are filled in during subsequent processing - the reason is that it may be quite time-consuming as their evaluation involves regular expressions, reverse DNS lookups, etc. Value in the "processed" column identifies if the row was already processed or not.
Columns os_id, arch_id and country_id are references to tables Architectures, Countries and Operating_systems. This table is referenced by table Actions.
Note: On several websites with high number of visitors I've applied partitioning on this table (according to the primary key), and thanks to propagation of this column it's possible to split the two other core tables (Actions a Parameters) in the same way. This seriously simplifies maintenance (backup and removal of stale data), but it may seriously improve performance of some queries during processing.
Actions
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| id | serial | no | primary key | |
| session_id | integer | no | primary key | |
| type_id | integer | no | ||
| page_id | integer | no | ||
| language | char(2) | no | ||
| action_date | timestamp | no | now() |
This table contains information about individual actions (pageview, form submit, etc.) within a session. For each action a reference to the "parent" session is stored, page where the action was performed (reference to the code-book Pages), type of the action (reference to the code-book Action_types), language of the action (2-character code of the language according to ISO 639-2 Code) and date of the action.
This table has a composite primary key (session_id, id).
Columns type_id and page_id are references to tables Action_types and Pages. This table is referenced from table Parameters.
Note: If the table Sessions is partitioned according to primary key, the table Actions may be partitioned according to "session_id" column (and table Parameters may be partitioned in the same way). Thanks to this a reference integrity may be enforced between partitions, and it may significantly simplify maintenance (backup and removal of stale data), but it may also improve performance of some queries executed when processing of the data.
Parameters
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| id | serial | no | primary key | |
| session_id | integer | no | primary key | |
| action_id | integer | no | primary key | |
| name | varchar(32) | no | ||
| value | varchar(255) | no |
Often it's useful to store some additional information about an action, but there are no "spare" columns in the table Actions - for example some internal parameters determined when processing the request (e.g. ID of the article displayed, information whether the processing of the submitted data was successful, etc.). These parameters may be stored in this table in the form of key-value pairs.
This table has composite primary key (session_id, action_id, id).
Note: If the tables Sessions and Actions are partitioned according to columns Sessions.id and Actions.session_id, the table Parameters may be partitioned in the same fashion, i.e. according to the "session_id" column. Thanks to this a reference integrity may be enforced between partitions, and it may significantly simplify maintenance (backup and removal of stale data), but it may also improve performance of some queries executed when processing of the data.
Pages
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| id | serial | no | primary key | |
| name | varchar(32) | no | ||
| description | text | yes |
This table is a code-book, with definition of pages of your website. In most cases a "page" means something like "a script," not individual articles (or other items) displayed by these scripts.
If you're using some CMS system, it's possible that there already is a table of pages defined - in this case it's necessary to synchronize these table (manually or automatically - e.g. using triggers) or merge them in a suitable way.
Countries
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| code | char(3) | no | primary key | |
| short_code | char(2) | yes | ||
| name | varchar(64) | no |
This table is a code-book of countries according to ISO 3166 (3-character codes are used as a primary key, but the table contains 2-characted codes too).
Languages
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| code | char(3) | no | primary key | |
| short_code | char(2) | yes | ||
| name | varchar(32) | no |
This table is a code-book of languages according to ISO 639-2 Code (3-character codes are used as a primary key, but the table contains 2-characted codes too).
Architectures
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| id | serial | no | primary key | |
| name | varchar(64) | no | ||
| priority | integer | no | 0 | |
| regexp | varchar(255) | no |
This table is a code-book of architectures (e.g. x86, x86_64, mips, apod.) with a definition of detection rules - regular expressions (and priorities). Architecture detection is performed by comparing the HTTP User-Agent header value against these regular expressions - from a set of matching rules the one with highest priority is used.
This table is continuously elaborated and new architectures and detection rules are added.
Browsers
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| id | serial | no | primary key | |
| name | varchar(128) | no | ||
| url | varchar(255) | yes | ||
| regexp | varchar(255) | no | ||
| version_regexp | varchar(255) | yes | ||
| priority | integer | no | 0 | |
| is_robot | boolean | no | false |
This table is a code-book of browsers (e.g. xMS IE, Firefox, Google Chrome, etc.) with a definition of detection rules - regular expressions (and priorities), and information whether the browser is a robot (e.g. googlebot etc.). Browser detection is performed just as a detection of an architecture - by comparing the HTTP User-Agent header value against these regular expressions - from a set of matching rules the one with highest priority is used.
This table is continuously elaborated and new browsers and detection rules are added.
Operating_systems
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| id | serial | no | primary key | |
| name | varchar(128) | no | ||
| url | varchar(255) | yes | ||
| priority | integer | no | 0 | |
| regexp | varchar(255) | no | ||
| version | varchar(255) | yes |
.This table is a code-book of operating systems (napÅ™. Windows XP, Linux, VAX/VMS, apod.) with a definition of detection rules - regular expressions (and priorities). Some operating systems may have multiple variants (e.g. there are multiple variants of Windows - XP, 2000, Vista, ...) and for these cases a "version" column exists.
Operating system detection is performed by comparing the HTTP User-Agent header value against these regular expressions - from a set of matching rules the one with highest priority is used.
This table is continuously elaborated and new operating systems and detection rules are added.
action_types
| name | type | NULL | default value | comment |
|---|---|---|---|---|
| id | serial | no | primary key | |
| name | varchar(128) | no |
This table is a code-book of action types (a pageview, form submit, etc.). You may create your own actions types by inserting rows into this table.




