Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Saturday, August 23, 2025

HyperLogLog


 

I this post we will review the HyperLogLog algorithm (HLL) and its extension the HyperLogLog++ algorithm (HLL++).


The Goal

The HLL goal is to find the cardinality of a set. For example, lets assume we get requests from multiple source IPs, and we want to count the distinct amount of source IPs. 


A Naive Approach

The naive solution would be to keep a map of source IPs, for example:


sourceIps:= make(map[string]bool)
for _,sourceIp:= range incomingRequests {
sourceIps[sourceIp] = true
}

fmt.Printf("sourceIps amount is: %v\n", sourceIps)


 The problem in such a solution is its scale. If we have 1G source IPs this counter would required about:

1G entries * size of entry =~ 10 GB RAM


Hence for very large amount of items this is not feasible.


Leading Zeros

The HLL base assumption is the probability of leading zeros. If we assume our items are uniformly distributed we can estimate the amount of items by checking the max leading zeros in a bits representation of the items. To get a uniformly distributed input we use a hash function.

For example, let's assume our input is a random 16 bits numbers. 


The probability to get an item with one leading zero: 0xxx xxxx xxxx xxxx is 0.5.

The probability to get an item with two leading zeros: 00xx xxxx xxxx xxxx is 0.25.

The probability to get an item with three leading zeros: 000x xxxx xxxx xxxx is 0.125.

...

The probability to get an item with k leading zeros: 000x xxxx xxxx xxxx is 0.5^k.


Hence if we only count the max leading zeros in our input by keeping the max leading zeros encountered.


If we have max of 1 leading zeros, we assume cardinality of 2 items.

If we have max of 2 leading zeros, we assume cardinality of 4 items.

If we have max of 3 leading zeros, we assume cardinality of 8 items.

...

If we have max of k leading zeros, we assume cardinality of 2^k items.


This is of course an estimation and is prone to errors.


Buckets

To improve the estimation, HLL uses buckets. 

Buckets are groups input items that for each we keep track of the max seen leading zeros. The amount of buckets used is 2^p, where p is the first p bits of the input.


For example:

Lets assume p=3, hence we have 2^3=8 buckets, and review a sequence of inputs.


For input 0010 0001 0001 0001 we have 2 leading zeros.
The first 3 bits of the input are 001, hence we update bucket1=2 max leading zeros.

For input 0110 0001 0001 0001 we have 1 leading zeros.
The first 3 bits of the input are 011, hence we update bucket3=1 max leading zeros.

For input 0000 0001 0001 0001 we have 7 leading zeros.
The first 3 bits of the input are 000, hence we update bucket0=7 max leading zeros.

For input 0000 1111 1111 1111 we have 4 leading zeros.
The first 3 bits of the input are 000, but bucket0 already has max leading zeros = 7 so we do not update it.


so the results are:


bucket0 has max 7 leading zeros  --> probability 2^-7

bucket2 has max 2 leading zeros  --> probability 2^-2 = 0.25

bucket3 has max 1 leading zeros --> probability 2^-1 = 0.5


Now we use harmonic mean to estimate the amount of values.

H = number of buckets / sum( probability bucket i)

H = 8 buckets / (2^-7 + 0.25 + 0.5) = 0.7578125


Estimated Cardinality 

Once we have the harmonic mean, we use the following to estimate the cardinality:


alpha = 0.7213​ / (1 + 1.079 / buckets) = 0.7213​ / (1 + 1.079 / 8) = 0.635576605


The 0.7213 and 1.079 are constants used for alpha regardless of the buckets count. There were provided as part of research on calibrating the HLL.


E = alpha * bucket^2 * H = 0.635576605 * 8^2 * 0.7578125 =~ 30


Hence the estimate cardinality is 30, which is far from the actual cardinality which is 4 items in out example.


Notice the HLL performs bad in small sets, but in large sets the accuracy is ~2% error.


Memory Requirements

Unlike the naive solution, HLL requires memory only for the buckets. A common configuration is to use 8 bits for the registers which translates to ~16K counters.


HLL++

HLL++ is an improvement by Google over HLL. 

It uses special algorithm for small datasets, and automatically switch to HLL once the input data is large enough.

It uses Sparse representation for the buckets to reduce memory footprint for small datasets.

In addition, HLL++ uses automatic bias correct according to the dataset size.


HLL++ Implementation Example


The following example uses the lytics HLL library.


import (
"crypto/sha1"
"encoding/binary"
"fmt"
"github.com/lytics/hll"
)

func CheckHllPlusPlus() {
const p = 14 // Max memory usage is 0.75 * 2^p bytes
const pPrime = 25 // Setting this is a bit more complicated, Google recommends 25.
myHll := hll.NewHll(p, pPrime)

myHash := func(value string) uint64 {
sha1Hash := sha1.Sum([]byte(value))
return binary.LittleEndian.Uint64(sha1Hash[0:8])
}

for i := range 1000 {
value := fmt.Sprintf("%v", i)
hashed := myHash(value)
myHll.Add(hashed)
}

fmt.Printf("estimated cardinality %v\n", myHll.Cardinality())

}



No comments:

Post a Comment