I this post we will review the HyperLogLog algorithm (HLL) and its extension the HyperLogLog++ algorithm (HLL++).
The Goal
The HLL goal is to find the cardinality of a set. For example, lets assume we get requests from multiple source IPs, and we want to count the distinct amount of source IPs.
A Naive Approach
The naive solution would be to keep a map of source IPs, for example:
sourceIps:= make(map[string]bool)
for _,sourceIp:= range incomingRequests {
sourceIps[sourceIp] = true
}
fmt.Printf("sourceIps amount is: %v\n", sourceIps)
The problem in such a solution is its scale. If we have 1G source IPs this counter would required about:
1G entries * size of entry =~ 10 GB RAM
Hence for very large amount of items this is not feasible.
Leading Zeros
The HLL base assumption is the probability of leading zeros. If we assume our items are uniformly distributed we can estimate the amount of items by checking the max leading zeros in a bits representation of the items. To get a uniformly distributed input we use a hash function.
For example, let's assume our input is a random 16 bits numbers.
The probability to get an item with one leading zero: 0xxx xxxx xxxx xxxx is 0.5.
The probability to get an item with two leading zeros: 00xx xxxx xxxx xxxx is 0.25.
The probability to get an item with three leading zeros: 000x xxxx xxxx xxxx is 0.125.
...
The probability to get an item with k leading zeros: 000x xxxx xxxx xxxx is 0.5^k.
Hence if we only count the max leading zeros in our input by keeping the max leading zeros encountered.
If we have max of 1 leading zeros, we assume cardinality of 2 items.
If we have max of 2 leading zeros, we assume cardinality of 4 items.
If we have max of 3 leading zeros, we assume cardinality of 8 items.
...
If we have max of k leading zeros, we assume cardinality of 2^k items.
This is of course an estimation and is prone to errors.
Buckets
To improve the estimation, HLL uses buckets.
Buckets are groups input items that for each we keep track of the max seen leading zeros. The amount of buckets used is 2^p, where p is the first p bits of the input.
For example:
Lets assume p=3, hence we have 2^3=8 buckets, and review a sequence of inputs.
For input 0010 0001 0001 0001 we have 2 leading zeros.
The first 3 bits of the input are 001, hence we update bucket1=2 max leading zeros.
For input 0110 0001 0001 0001 we have 1 leading zeros.
The first 3 bits of the input are 011, hence we update bucket3=1 max leading zeros.
For input 0000 0001 0001 0001 we have 7 leading zeros.
The first 3 bits of the input are 000, hence we update bucket0=7 max leading zeros.
For input 0000 1111 1111 1111 we have 4 leading zeros.
The first 3 bits of the input are 000, but bucket0 already has max leading zeros = 7 so we do not update it.
so the results are:
bucket0 has max 7 leading zeros --> probability 2^-7
bucket2 has max 2 leading zeros --> probability 2^-2 = 0.25
bucket3 has max 1 leading zeros --> probability 2^-1 = 0.5
Now we use harmonic mean to estimate the amount of values.
H = number of buckets / sum( probability bucket i)
H = 8 buckets / (2^-7 + 0.25 + 0.5) = 0.7578125
Estimated Cardinality
Once we have the harmonic mean, we use the following to estimate the cardinality:
alpha = 0.7213 / (1 + 1.079 / buckets) = 0.7213 / (1 + 1.079 / 8) = 0.635576605
The 0.7213 and 1.079 are constants used for alpha regardless of the buckets count. There were provided as part of research on calibrating the HLL.
E = alpha * bucket^2 * H = 0.635576605 * 8^2 * 0.7578125 =~ 30
Hence the estimate cardinality is 30, which is far from the actual cardinality which is 4 items in out example.
Notice the HLL performs bad in small sets, but in large sets the accuracy is ~2% error.
Memory Requirements
Unlike the naive solution, HLL requires memory only for the buckets. A common configuration is to use 8 bits for the registers which translates to ~16K counters.
HLL++
HLL++ is an improvement by Google over HLL.
It uses special algorithm for small datasets, and automatically switch to HLL once the input data is large enough.
It uses Sparse representation for the buckets to reduce memory footprint for small datasets.
In addition, HLL++ uses automatic bias correct according to the dataset size.
HLL++ Implementation Example
The following example uses the lytics HLL library.
import (
"crypto/sha1"
"encoding/binary"
"fmt"
"github.com/lytics/hll"
)
func CheckHllPlusPlus() {
const p = 14 // Max memory usage is 0.75 * 2^p bytes
const pPrime = 25 // Setting this is a bit more complicated, Google recommends 25.
myHll := hll.NewHll(p, pPrime)
myHash := func(value string) uint64 {
sha1Hash := sha1.Sum([]byte(value))
return binary.LittleEndian.Uint64(sha1Hash[0:8])
}
for i := range 1000 {
value := fmt.Sprintf("%v", i)
hashed := myHash(value)
myHll.Add(hashed)
}
fmt.Printf("estimated cardinality %v\n", myHll.Cardinality())
}