Parallelizing DISTINCT estimation

When studying algorithms for DISTINCT estimation, I've noticed a quite interesting feature of one of the algorithms - adaptive sampling. It's actually quite simple to parallelize, i.e. the input dataset can be split into partitions, the estimate may be evaluated for each of the partitions (in a separate process) and these partial results are used to get the final estimate. Let's see the principle that allows such parallelization and how it might be used.

Aggregate functions for DISTINCT estimation

I've been working on implementing cross-column stats into PostgreSQL for some time (see the psql-hackers thread and wiki), and it seems that one of the necessary steps should be improving estimates of number of distinct values (thread, wiki). The first really interesting finding is that to get a reliable estimate of distinct values, you really need to read most of the table (generally all of it). The second interesting thing is that there is a lot of very interesting algorithms to continuously update the  estimate in a very limited space. The experimental implementation of the algorithms is very promising, but how to demonstrate it? Using aggregate functions!

1