Parallelizing DISTINCT estimation

When studying algorithms for DISTINCT estimation, I've noticed a quite interesting feature of one of the algorithms - adaptive sampling. It's actually quite simple to parallelize, i.e. the input dataset can be split into partitions, the estimate may be evaluated for each of the partitions (in a separate process) and these partial results are used to get the final estimate. Let's see the principle that allows such parallelization and how it might be used.

This article is not meant as a thorough description or analysis of the adaptive sampling algorithm, that is available in the "On Adaptive Sampling" paper. For explanation of the parallelization a very brief explanation, available in the next chapter, is fully sufficient.

Basic description of the algorithm

The whole algorithm works with a simple data structure called AdaptiveCounter

struct AdaptiveCounter {
   int level;
   int items;
   int maxItems;
   unsigned char bitmap[1];
};

that keeps current info about the state of the estimator. Individual fiels have this meaning:

  • level - the level of the estimator (the meaning will be obvious later), default value is 0
  • items - current number of items in the list
  • maxItems - max allowed items in the list (determines size of the bitmap)
  • bitmap - list of items (their hashes) in the form of a bitamp, empty at the beginning

The maximal number of items in the list (i.e. length of the bitmap used to keep the list) is determined by a requested precision of the estimates. The data structure used in the implementation (see adaptive.c @ pg_estimator) is a bit more complex, but for purposes of this article the simplified version is fully sufficient.

Adding a new item into the estimator works like this:

  1. compute hash of the item and check if the "level" bits at the beginning are all 1
  2. if not, terminate
  3. if yes, add the hash into the "bitmap" list (copy the hash into the bitmap)
  4. if the list is full (items == maxItems), increment the (level = level + 1) and remove the items that do not match the filter (check the new bit at the level-th position)

The "filtering" of items added to the list (bitmap), that says that the list contains just values with first "level" bits equal to 1, is the basic idea behind the adaptive sampling estimator. It means that (if the hashing produces evenly distributed values), the probability of adding an item into the list is 1/(2^level).

This decreasing probability results into the following estimate of the number of unique values, based on the current "level" and current number of values in the list:

estimate = current number of items * (2^level)

Obviously, the algorithm is very simple (although very precise and robust).

Merge

The algorithm has one very nice feature, quite uncommon when dealing with the other algorithms, that may be efficiently used to parallelize the algorithm. It's possible to "merge" two estimators with the same features without loosing any precision.

Given two estimators A and B with the same features (i.e. requested precision and thus the bitmap length "A.maxItems == B.maxItems"), the estimators may be merged using this algorithm:

  1. let A.level <= B.level (switch the estimators if this does not hold)
  2. insert all the items from list A.bitmap to B.bitmap (using the algorithm described above)
  3. the result is the modified estimator B

The resulting estimator is exactly the same as if we have used one estimator right from the beginning.

Parallelization

How to use this to parallelize the algorithm? We can split the dataset into several parts - in case of PostgreSQL we can use partitioned tables - and build the AdaptiveCounter estimator for each partition independently (in a separate process).

The main process then collects these 'partial estimators' (instances of AdaptiveCounter), merges them in one final estimator and generates the estimato for the whole table.

Comments

There are no comments for this article (or are awaiting acceptance).

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)