Calculating quantiles and trimmed aggregates

If you've ever needed to compute median or other quantiles, you probably already know there are no built-in aggregates for that in PostgreSQL. There are several solutions available, with various level of flexibility and performance - let me name at least Joe Conway's solution based on PL/R and depesz's PL/pgSQL solution. I personally wasn't happy with any of those but as I'm really lazy, I haven't written anything on my own ... until last week. The extension (already available on pgxn) is based on a disgusting trick (more on that later), but seems to work fine, especially when it comes to performance.

When I wrote that, I realized I could use the same basic idea to implement trimmed averages - not sure if that's the right term in english, but it means something like "remove 3% of the lowest and 2% of the highest values and compute AVG/VARIANCE of the remaining data." Basically it's a neat way to get rid of the outliers. This is available on pgxn too.

So lets see how that works, what's the performance, etc.

So lets describe the available solutions first - how that works, etc. The PL/R solution is very simple

CREATE OR REPLACE FUNCTION r_quantile(float8[]) RETURNS float8[] AS '
    quantile(arg1, probs = seq(0, 1, 0.25), names = FALSE)
' LANGUAGE 'plr';

CREATE AGGREGATE quantile (
    sfunc = plr_array_accum,
    basetype = float8,
    stype = float8[],
   finalfunc = r_quantile
);

So it just collects the data into an array of float8 values (using the plr_array_accum function) and then calls the r_quantile to compute the actual quantile (or actually an array of quantiles at 0, 0.25, 0.5, 0.75 and 1).

The depesz's solution is based on PL/pgSQL and uses about the same approach (collects data into a varlena-based array) except that the last step is implemented in PL/pgSQL. The original solution uses numeric data type, but we'll use float8 just to keep the implementations comparable.

CREATE OR REPLACE FUNCTION median_aggregate_f( in_array FLOAT8[] )
  RETURNS FLOAT8 as $$
DECLARE
  element_count INT4;
  get_rows INT4 := 1;
  reply FLOAT8;
BEGIN

  element_count := array_upper(in_array,1) - array_lower(in_array,1);
 
  IF element_count IS NULL THEN
    RETURN NULL;
  END IF;
 
  get_rows := get_rows + ( element_count % 2 );
 
  SELECT avg(e) INTO reply FROM (
    SELECT unnest(in_array) as e
      ORDER BY e
      LIMIT get_rows OFFSET floor(element_count / 2)
  ) x;
 
  RETURN reply;

END;
$$ language plpgsql;
 
CREATE AGGREGATE median ( FLOAT8) (
  SFUNC = array_append,
  STYPE = FLOAT8[],
  FINALFUNC = median_aggregate_f,
  INITCOND = '{}'
);

So it appends the data to an array and then uses a PL/pgSQL function median_aggregate_f to sort the data and get the middle one (or an average of the two).

If you ever used any of those with non-trivial amount of data, you've probably noticed they're not exactly fast. The main culprit seems to be the varlena-based array, more precisely how it behaves when accumulating new values.

My solution

What I've decided to do was to implement those in C, and to use a regular array created with palloc and extended on the fly with repalloc. This extension does not happen every time an element is added, but in blocks (currently each 1024 elements, but this can be easily changed).

You're probably asking how is this array passed to the next execution of the function and then to the final function. This is where the disgusting trick happens (sickbags on standby). The array is allocated on  the first execution in the parent memory context, and the pointer is returned as an integer. Actually it's not a pointer to an array but to a structure, but that does not matter here - the principle remains the same.

I'm not going to post any code here (check this), the description sounds terryfying enough.

Update: It seems that this (passing a pointer) is not as despicable as I thought originally. Actually there's a special data type for that (called internal). And obviously it's important to prevent direct execution (not in the context of aggregate function).

Performance

So, it's time to see what is the performance compared to the previous two solutions. I'll use a simple data set - a table with a single double precision column, with 1.000, 10.000, 100.000 and 1.000.000 rows.

CREATE TABLE test_table (val double precision);
INSERT INTO test_table SELECT i FROM generate_series(1,1000) s(i)
                        ORDER BY random();
ANALYZE test_table;

And then run and see the performance for each of the approaches (executed repeatedly, to get rid of caching effects):

  rows time
PL/R 1.000 9ms
PL/pgSQL 1.000 6ms
C 1.000 2ms
PL/R 10.000 130ms
PL/pgSQL 10.000 120ms
C 10.000 8ms
PL/R 100.000 346.157ms
PL/pgSQL 100.000 350.327ms
C 100.000 70ms
PL/R 1.000.000 - (did not complete in 10 minutes)
PL/pgSQL 1.000.000 - (did not complete in 10 minutes)
C 1.000.000 600ms

Lets see the results as a chart (logarithmic):

Which is actually a great result.

Other options

The quantile aggregates are defined for double precision, int and bigint data types - I'm struggling a bit with the Numeric data type. Hopefully that should be available soon. The aggregates come in two forms - you can either specify a scalar parameter or an array of scalar parameters:

-- median
SELECT quantile(val, 0.5) FROM test_table;

-- quartiles
SELECT quantilee(val, ARRAY[0.25, 0.5, 0.75]) FROM test_table;

If you need more values (e.g. all the quartiles), you should definitely use the other form because it may save considerable amount of memory and CPU time as it needs to collect and sort just a single instance of the array.

Trimmed aggregates

The very same trick (passing pointer) was used to implement trimmed aggregates, mirroring the usual aggregates, i.e. AVG, VARIANCE and STDDEV - this table illustrates the relation

original aggregate trimmed aggregate
AVG AVG_TRIMMED
  VAR_TRIMMED
VARIANCE, VAR_SAMP VAR_SAMP_TRIMMED
  VAR_POP_TRIMMED
  STDDEV_TRIMMED
STDDEV, STDDEV_SAMP STDDEV_SAMP_TRIMMED
STDDEV_POP STDDEV_POP_TRIMMED

Not that there's not a matching aggregate for VAR_TRIMMED and STDDEV_TRIMMED, because the VARIANCE and STDDEV aggregates are just estimates.

Using those aggregates is very simple - you can specify how many lowest/highest values to trim. So for example to trim 5% of the lowest and 10% of the highest values, and compute average of the remaining values, you can do this

SELECT avg_trimmed(val, 0.05, 0.1) FROM test_table;

and by specifying 0 for both parameters you'll get aggregate of the whole sample

SELECT avg_trimmed(val, 0, 0) FROM test_table;

That's all. There's a bit more details in README.

Comments

License?

Any reason you've picked the GPL instead of a more liberal license like MIT or BSD?

RE: License?

No particular reason, I generally prefer copyleft licenses. Is that a problem? I'm the only author of this so I can change the license at will (to a more permissive one) if needed.

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)