Aggregate functions for DISTINCT estimation

I've been working on implementing cross-column stats into PostgreSQL for some time (see the psql-hackers thread and wiki), and it seems that one of the necessary steps should be improving estimates of number of distinct values (thread, wiki). The first really interesting finding is that to get a reliable estimate of distinct values, you really need to read most of the table (generally all of it). The second interesting thing is that there is a lot of very interesting algorithms to continuously update the  estimate in a very limited space. The experimental implementation of the algorithms is very promising, but how to demonstrate it? Using aggregate functions!

Aggregate functions presented below are a "side effect" of the effort to implement better estimates for number of distinct values - but it's a nice way to present the advantages and disadvantages of the algorithms. Be aware that the results are estimates, not exact numbers, so if you need exact numbef of distinct values, this is not a suitable solution.

Most of the algorithms is of probabilistic nature - while the usual "precise" algorithms use data structures like a hash table, trie or a sorted set, these algorithms are based on bitmaps where the bits are set with some probability. This is not exactly true for Bloom Counter (based on Bloom filtru), Adaptive Sampling (based on Wegman's adaptive sampling) and very different Rough Estimator.

What advantages do these algorithms have? First of all it's not necessary to store complete sets of distinct values, which results in a very limited memory and CPU requirements. Another advantage is it is possible to continuously update the estimator with newly inserted rows (but that's handy especially in case of internal statistics, not for aggregates). And finally the results are quite precise, and it's possible to limit the error rate in exchange for more memory.

So if you can live with COUNT(DISTINCT) estimates, read on - let's see two simple benchmark first (demonstrating precision and requirements of the algorithms), then we'll se how to install those aggregates and finally a short conclusion.

Benchmarks

There are some very simple benchmarks (script build.sh builds executables ending with "-bench"), but that's something different. The benchmarks described here are executed directly from SQL so that you can compare them with COUNT(DISTINCT).

The benchmarks are based on a very simple table with one column ("text" data type in most cases, but in case of Rough Estimator it's "int"), which is filled with a chosen number of tuples and distinct values. If interested, download the benchmark.sql - then just set variable "c" (number of tuples) and "n" (number of distinct values).

db=# \set c 1000000
db=# \set n 100
db=# \i benchmark.sql

The output can be easily redirected to a file ("\o").

Benchmark no. 1

The first benchmark works with one million tuples, each row containing a distinct values.

algorithm result error time (ms) memory (B)
COUNT(DISTINCT) 1000000 0% 14026 ?
Adaptive Sampling 1008128 0.8% 8653 57627
Self-learning Bitmap 991651 0.9% 1151 65571
Bloom filter 788052 22% 2400 1198164
Probabilistic Counting 1139925 14% 3613 95
PCSA 841735 16% 842 495

And now a separate table for Rough Estimator (works with integer values, not text):

algorithm result error time (ms) memory (B)
COUNT(DISTINCT) 1000000 0% 1775 ?
Rough Estimator 524288 48% 3216 96

 

Benchmark no. 2

The second benchmark works with one million tuples too, but this time there are only 100 distinct values (i.e.. for each value there is about 10000 rows). First a table for aggregate functions working with text values

algorithm result error time (ms) memory (B)
COUNT(DISTINCT) 100 0% 75306 ?
Adaptive Sampling 100 0% 1491 57627
Self-learning Bitmap 101 1% 1031 65571
Bloom filter 100 0% 1675 1198164
Probabilistic Counting 95 5% 3643 95
PCSA 98 2% 852 495

And now a separate table for Rough Estimator (works with integer values, not text):

algorithm result error time (ms) memory (B)
COUNT(DISTINCT) 100 0% 1778 ?
Rough Estimator 128 28% 3113 96

 

Install

The repository is available at github, just clone it

$ git clone git://github.com/tvondra/pg_estimator

or you can download tar archive. Extract it and go to the "distinct" directory and execute the script "build-agg.sh" that compiles the shared libraries. It's quite primitive - no autoconf/automake/make, just a simple shell script.

Be careful to configure the environment properly, especially check that the pg_config returns correct paths - the script uses "pg_config --includedir-server" to locate the necessary header files.

$ pg_config --includedir-server
/var/postgresql/9.1a3/include/postgresql/server
$ sh build-agg.sh
$ ls *.so
adaptive.so  bitmap.so  bloom.so  pcsa.so  probabilistic.so  rough.so

And then just copy the shared libraries to the proper directory (pkglibdir):

$ cp *.so `pg_config --pkglibdir`

And now just execute the SQL script that creates the aggregate functions:

$ psql testdb < install-agg.sql

If everything worked fine, you can start playing with the aggregate functions

  • ADAPTIVE_DISTINCT(text)
  • BITMAP_DISTINCT(text)
  • BLOOM_DISTINCT(text)
  • PROBABILISTIC_DISTINCT(text)
  • PCSA_DISTINCT(text)
  • ROUGH_DISTINCT(int)

If you notice something suspicious, let me know please. It's very fresh, so I guess there are some bugs I haven't noticed yet.

Conclusion

It's obvious most of the algoritms work surprisingly well - it's much faster than the simple COUNT(DISTINCT), it requires very limited number of memory and the results are very good. In case of adaptive sampling and a self-learning bitmap I'd use a word "excellend."

In the PCSA algorithm (and Probabilistic Counting, which is a simplified version of PCSA), keep in mind this was the first algorithm available, and most of the other algorithms build on it and improve it in different ways. But it gives very good results, especially when you consider the minimal amount of required memory (95 or 495B).

In case of estimator based on a Bloom filter, the situation is a bit more complicated. The Bloom filter is designed so that it's possible decide whether an element is member of a set, and it (a) produces no false negatives and (b) the false positive rate is limited (lower error rate requires more memory). This results in underestimates, but it should be fixable using a coefficient computed from the number of elements and false positive error rate (not done in the available implementation).

A bit disappointing are the "Rough Estimator" results, as this algorithm is described in the "An Optimal Algorithm for the Distinct Elements Problem" but it's the most complex algorithm and the implementation is not complete yet. So there's a possibility it will improve.

Comments

missing link desination

the link to http://www.fuzzy.cz/files/benchmark.sql does not work :(

RE: missing link desination

Ooops :-(

The correct URL should be http://www.fuzzy.cz/download/benchmark.sql (corrected in the text).

What about LogLog, HyperLogLog?

It's interesting you not mentioned/implemented LogLog and HyperLogLog algorithms while they seems to be the most recent research results.

RE: What about LogLog, HyperLogLog?

Frankly, this was a "fun project" and I've missed those two algorithms somehow - thanks for the message. I do have a prototype implementation of the LogLog, the HyperLogLog is not much different - the difference is mostly in how the final estimate is computed. I'll polish the algorithms in the next few days and then I'll push it to github.

The results are interesting, although for low cardinalities the errors are quite significant (but maybe there's a bug in the current implementation).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)