Calculating quantiles and trimmed aggregates
If you've ever needed to compute median or other quantiles, you probably already know there are no built-in aggregates for that in PostgreSQL. There are several solutions available, with various level of flexibility and performance - let me name at least Joe Conway's solution based on PL/R and depesz's PL/pgSQL solution. I personally wasn't happy with any of those but as I'm really lazy, I haven't written anything on my own ... until last week. The extension (already available on pgxn) is based on a disgusting trick (more on that later), but seems to work fine, especially when it comes to performance.
When I wrote that, I realized I could use the same basic idea to implement trimmed averages - not sure if that's the right term in english, but it means something like "remove 3% of the lowest and 2% of the highest values and compute AVG/VARIANCE of the remaining data." Basically it's a neat way to get rid of the outliers. This is available on pgxn too.
So lets see how that works, what's the performance, etc.
So lets describe the available solutions first - how that works, etc. The PL/R solution is very simple
CREATE OR REPLACE FUNCTION r_quantile(float8[]) RETURNS float8[] AS ' quantile(arg1, probs = seq(0, 1, 0.25), names = FALSE) ' LANGUAGE 'plr'; CREATE AGGREGATE quantile ( sfunc = plr_array_accum, basetype = float8, stype = float8[], finalfunc = r_quantile );
So it just collects the data into an array of float8 values (using the plr_array_accum function) and then calls the r_quantile to compute the actual quantile (or actually an array of quantiles at 0, 0.25, 0.5, 0.75 and 1).
The depesz's solution is based on PL/pgSQL and uses about the same approach (collects data into a varlena-based array) except that the last step is implemented in PL/pgSQL. The original solution uses numeric data type, but we'll use float8 just to keep the implementations comparable.
CREATE OR REPLACE FUNCTION median_aggregate_f( in_array FLOAT8[] )
RETURNS FLOAT8 as $$
DECLARE
element_count INT4;
get_rows INT4 := 1;
reply FLOAT8;
BEGIN
element_count := array_upper(in_array,1) - array_lower(in_array,1);
IF element_count IS NULL THEN
RETURN NULL;
END IF;
get_rows := get_rows + ( element_count % 2 );
SELECT avg(e) INTO reply FROM (
SELECT unnest(in_array) as e
ORDER BY e
LIMIT get_rows OFFSET floor(element_count / 2)
) x;
RETURN reply;
END;
$$ language plpgsql;
CREATE AGGREGATE median ( FLOAT8) (
SFUNC = array_append,
STYPE = FLOAT8[],
FINALFUNC = median_aggregate_f,
INITCOND = '{}'
);
So it appends the data to an array and then uses a PL/pgSQL function median_aggregate_f to sort the data and get the middle one (or an average of the two).
If you ever used any of those with non-trivial amount of data, you've probably noticed they're not exactly fast. The main culprit seems to be the varlena-based array, more precisely how it behaves when accumulating new values.
My solution
What I've decided to do was to implement those in C, and to use a regular array created with palloc and extended on the fly with repalloc. This extension does not happen every time an element is added, but in blocks (currently each 1024 elements, but this can be easily changed).
You're probably asking how is this array passed to the next execution of the function and then to the final function. This is where the disgusting trick happens (sickbags on standby). The array is allocated on the first execution in the parent memory context, and the pointer is returned as an integer. Actually it's not a pointer to an array but to a structure, but that does not matter here - the principle remains the same.
I'm not going to post any code here (check this), the description sounds terryfying enough.
Update: It seems that this (passing a pointer) is not as despicable as I thought originally. Actually there's a special data type for that (called internal). And obviously it's important to prevent direct execution (not in the context of aggregate function).
Performance
So, it's time to see what is the performance compared to the previous two solutions. I'll use a simple data set - a table with a single double precision column, with 1.000, 10.000, 100.000 and 1.000.000 rows.
CREATE TABLE test_table (val double precision);
INSERT INTO test_table SELECT i FROM generate_series(1,1000) s(i)
ORDER BY random();
ANALYZE test_table;
And then run and see the performance for each of the approaches (executed repeatedly, to get rid of caching effects):
| rows | time | |
| PL/R | 1.000 | 9ms |
| PL/pgSQL | 1.000 | 6ms |
| C | 1.000 | 2ms |
| PL/R | 10.000 | 130ms |
| PL/pgSQL | 10.000 | 120ms |
| C | 10.000 | 8ms |
| PL/R | 100.000 | 346.157ms |
| PL/pgSQL | 100.000 | 350.327ms |
| C | 100.000 | 70ms |
| PL/R | 1.000.000 | - (did not complete in 10 minutes) |
| PL/pgSQL | 1.000.000 | - (did not complete in 10 minutes) |
| C | 1.000.000 | 600ms |
Lets see the results as a chart (logarithmic):
Which is actually a great result.
Other options
The quantile aggregates are defined for double precision, int and bigint data types - I'm struggling a bit with the Numeric data type. Hopefully that should be available soon. The aggregates come in two forms - you can either specify a scalar parameter or an array of scalar parameters:
-- median SELECT quantile(val, 0.5) FROM test_table; -- quartiles SELECT quantilee(val, ARRAY[0.25, 0.5, 0.75]) FROM test_table;
If you need more values (e.g. all the quartiles), you should definitely use the other form because it may save considerable amount of memory and CPU time as it needs to collect and sort just a single instance of the array.
Trimmed aggregates
The very same trick (passing pointer) was used to implement trimmed aggregates, mirroring the usual aggregates, i.e. AVG, VARIANCE and STDDEV - this table illustrates the relation
| original aggregate | trimmed aggregate |
| AVG | AVG_TRIMMED |
| VAR_TRIMMED | |
| VARIANCE, VAR_SAMP | VAR_SAMP_TRIMMED |
| VAR_POP_TRIMMED | |
| STDDEV_TRIMMED | |
| STDDEV, STDDEV_SAMP | STDDEV_SAMP_TRIMMED |
| STDDEV_POP | STDDEV_POP_TRIMMED |
Not that there's not a matching aggregate for VAR_TRIMMED and STDDEV_TRIMMED, because the VARIANCE and STDDEV aggregates are just estimates.
Using those aggregates is very simple - you can specify how many lowest/highest values to trim. So for example to trim 5% of the lowest and 10% of the highest values, and compute average of the remaining values, you can do this
SELECT avg_trimmed(val, 0.05, 0.1) FROM test_table;
and by specifying 0 for both parameters you'll get aggregate of the whole sample
SELECT avg_trimmed(val, 0, 0) FROM test_table;
That's all. There's a bit more details in README.






Any reason you've picked the GPL instead of a more liberal license like MIT or BSD?
No particular reason, I generally prefer copyleft licenses. Is that a problem? I'm the only author of this so I can change the license at will (to a more permissive one) if needed.