OpenStats - database structure

Goal of this article is to describe the database structure used for OpenStats project in a nutshell (but with enough details). It contains description of individual tables and their columns, but an explanation of some seemingly illogical solutions.

Database of OpenStats project (resp. it's part used for collection of information), consists of the following tables:

  • sessions - list of sessions and information about the visitor
  • actions - list of actions for sessions (each session consists of one or more actions)
  • parameters - list of parameters for actions (key / value pairs)
  • pages - list of pages (each action is tied to a particular page)
  • countries - list of countries (the visitors came from)
  • languages - list of languages (the visitors are using)
  • architectures - list of architectures (e.g. x86, mips, ...)
  • browsers - list of browsers (MS IE, Firefox, Mozilla, ...)
  • operating_systems - list of operating systems (Windows XP, Linux, ...)
  • action_types - list of action types (pageview, submit form, ...)

First three tables (Sessions - Actions - Parameters) are crucial for the whole system, it's a core of the whole system and it's extended by the other tables. Let's take a look at individual tables, their structure, meaning,  usage and connection to the other tables.

Sessions

This table contains information about the whole session, i.e. basic information about the visitor - IP address, information about the browser and operating system, country, and some HTTP headers. Structure of the table is this:

name type NULL default value comment
id serial no   primary key
browser_id integer yes   deferred
os_id integer yes   deferred
arch_id integer yes   deferred
country_id char(2) yes   deferred
visitor_id integer no    
ip inet no    
http_ip_forward varchar(255) yes    
http_referer text yes    
http_user_agent varchar(255) yes    
http_language varchar(255) yes    
session_date timestamp no now()  
browser_minor varchar(16) yes    
browser_major varchar(16) yes    
processed boolean no false  

Some of the parameters (marked as "deferred" in the comment) are filled in during subsequent processing - the reason is that it may be quite time-consuming as their evaluation involves regular expressions, reverse DNS lookups, etc. Value in the "processed" column identifies if the row was already processed or not.

Columns os_id, arch_id and country_id are references to tables Architectures, Countries and Operating_systems. This table is referenced by table Actions.

Note: On several websites with high number of visitors I've applied partitioning on this table (according to the primary key), and thanks to propagation of this column it's possible to split the two other core tables (Actions a Parameters) in the same way. This seriously simplifies maintenance (backup and removal of stale data), but it may seriously improve performance of some queries during processing.

Actions

name type NULL default value comment
id serial no   primary key
session_id integer no   primary key
type_id integer no    
page_id integer no    
language char(2) no    
action_date timestamp no now()  

This table contains information about individual actions (pageview, form submit, etc.) within a session. For each action a reference to the "parent" session is stored, page where the action was performed (reference to the code-book Pages), type of the action (reference to the code-book Action_types), language of the action (2-character code of the language according to ISO 639-2 Code) and date of the action.

This table has a composite primary key (session_id, id).

Columns type_id and page_id are references to tables Action_types and Pages. This table is referenced from table Parameters.

Note: If the table Sessions is partitioned according to primary key, the table Actions may be partitioned according to "session_id" column (and table Parameters may be partitioned in the same way). Thanks to this a reference integrity may be enforced between partitions, and it may significantly simplify maintenance (backup and removal of stale data), but it may also improve performance of some queries executed when processing of the data.

Parameters

name type NULL default value comment
id serial no   primary key
session_id integer no   primary key
action_id integer no   primary key
name varchar(32) no    
value varchar(255) no    

Often it's useful to store some additional information about an action, but there are no "spare" columns in the table Actions - for example some internal parameters determined when processing the request (e.g. ID of the article displayed, information whether the processing of the submitted data was successful, etc.). These parameters may be stored in this table in the form of key-value pairs.

This table has composite primary key (session_id, action_id, id).

Note: If the tables Sessions and Actions are partitioned according to columns Sessions.id and Actions.session_id, the table Parameters may be partitioned in the same fashion, i.e. according to the "session_id" column. Thanks to this a reference integrity may be enforced between partitions, and it may significantly simplify maintenance (backup and removal of stale data), but it may also improve performance of some queries executed when processing of the data.

Pages

name type NULL default value comment
id serial no   primary key
name varchar(32) no    
description text yes    

This table is a code-book, with definition of pages of your website. In most cases a "page" means something like "a script," not individual articles (or other items) displayed by these scripts.

If you're using some CMS system, it's possible that there already is a table of pages defined - in this case it's necessary to synchronize these table (manually or automatically - e.g. using triggers) or merge them in a suitable way.

Countries

name type NULL default value comment
code char(3) no   primary key
short_code char(2) yes    
name varchar(64) no    

This table is a code-book of countries according to ISO 3166 (3-character codes are used as a primary key, but the table contains 2-characted codes too).

Languages

name type NULL default value comment
code char(3) no   primary key
short_code char(2) yes    
name varchar(32) no    

This table is a code-book of languages according to ISO 639-2 Code (3-character codes are used as a primary key, but the table contains 2-characted codes too).

Architectures

name type NULL default value comment
id serial no   primary key
name varchar(64) no    
priority integer no 0  
regexp varchar(255) no    

This table is a code-book of architectures (e.g. x86, x86_64, mips, apod.) with a definition of detection rules - regular expressions (and priorities). Architecture detection is performed by comparing the HTTP User-Agent header value against these regular expressions - from a set of matching rules the one with highest priority is used.

This table is continuously elaborated and new architectures and detection rules are added.

Browsers

name type NULL default value comment
id serial no   primary key
name varchar(128) no    
url varchar(255) yes    
regexp varchar(255) no    
version_regexp varchar(255) yes    
priority integer no 0  
is_robot boolean no false  

This table is a code-book of browsers (e.g. xMS IE, Firefox, Google Chrome, etc.) with a definition of detection rules - regular expressions (and priorities), and information whether the browser is a robot (e.g. googlebot etc.). Browser detection is performed just as a detection of an architecture - by comparing the HTTP User-Agent header value against these regular expressions - from a set of matching rules the one with highest priority is used.

This table is continuously elaborated and new browsers and detection rules are added.

Operating_systems

name type NULL default value comment
id serial no   primary key
name varchar(128) no    
url varchar(255) yes    
priority integer no 0  
regexp varchar(255) no    
version varchar(255) yes    

.This table is a code-book of operating systems (napÅ™. Windows XP, Linux, VAX/VMS, apod.) with a definition of detection rules - regular expressions (and priorities). Some operating systems may have multiple variants (e.g. there are multiple variants of Windows - XP, 2000, Vista, ...) and for these cases a "version" column exists.

Operating system detection is performed by comparing the HTTP User-Agent header value against these regular expressions - from a set of matching rules the one with highest priority is used.

This table is continuously elaborated and new operating systems and detection rules are added.

action_types

name type NULL default value comment
id serial no   primary key
name varchar(128) no    

This table is a code-book of action types (a pageview, form submit, etc.). You may create your own actions types by inserting rows into this table.

Comments

There are no comments for this article (or are awaiting acceptance).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)