Parallelizing DISTINCT estimation
When studying algorithms for DISTINCT estimation, I've noticed a quite interesting feature of one of the algorithms - adaptive sampling. It's actually quite simple to parallelize, i.e. the input dataset can be split into partitions, the estimate may be evaluated for each of the partitions (in a separate process) and these partial results are used to get the final estimate. Let's see the principle that allows such parallelization and how it might be used.




