The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with “1”, 25% starts with “01”, 12,5% starts with “001”. This means that if you observe a random stream and see a “001”, there is a higher chance that this stream has a cardinality of 8.
With a single HyperLogLog structure using 2.56 KB you can count the number of unique items up to approximately 7,900,000,000 items with 1.625% error. This can be very efficient for analytic applications, when for example, you want to calculate how many unique users have visited the URL.
1,625% error means that if we’re trying to count the number of unique license plate numbers for cars, if our HyperLogLog counter said there were 654,192,028, we would be confident that the actual number is between 664,822,648 and 643,561,407. Furthermore, if this accuracy is not sufficient, you can simply add more memory to the structure and it will perform better. Giving it 40.96KB of resources will decrease the error from 1.625% to 0.4%. However, storing this data in a set would take 3.925GB, even assuming no overhead!
For example, traditional way to calculate this type of aggregation will require use of hashtable, witch could explode your data storage need. HyperLogLog has much smaller memory footprint, but the trade-off is accuracy, which you can tune.For example, in 1280 bytes HyperLogLog can estimate the count of tens of billions of distinct values with only a small percent error.
Interesting implementation of HyperLogLog is java-hll.
java-hll implementation also provide toBytes method that we can use to serialize the data structure into Hadoop serialization framework like Avro and fromBytes method for deserialization. Some databases, for example, Redis, have HyperLogLog as a supported data structure. HLL shines in real-time or streaming environment, where we can get aggregated counter very fast, albeit with small percent of error. If you implementing Lambda architecture, then this error can be fixed in batch layer later on.
- The 12KB bounded size for a practically unbounded set (read billions of items) is extremely memory efficient (in Redis).
- The operation PFCOUNT is fast enough for real time queries. Reading directly from this for front end dashboards is totally possible.
Unions are said to be ‘loss-less’ in HLL. In other words, they work extremely well. So, to get the count of the union of two sets, we can ‘merge’ the HLL data structures representing the two sets to get a new HLL data structure and get the count of the result. The merge operation for two HLLs with equal number of buckets involves taking the maximum value of each pair of buckets and assigning that as the value of the same bucket in the resultant HLL.
To see the intuition behind this, remember that the HLL algorithm only maintains the maximum number of consecutive zeros seen in the hashes of the items for a given bucket. So, if two items are hashing to the same bucket, the one with the maximum number of zeros contributes the value to be stored in the bucket. Hence, our algorithm for merging HLLs described above will be equivalent to replaying the stream by merging the original items.
In Redis, there is explicit support for this operation using a command called PFMERGE
- PFMERGE <result> <key1> <key2> … – Merges the HLLs stored at key1, key2, etc into result.
One can issue PFCOUNT <result> to get the cardinality of the union.
Some interesting points about unions:
- The merge operation is associative – hence we can merge multiple keys together into one single result, as indeed the PFMERGE command allows in Redis.
- The merge operation is parallelizable with respect to the buckets and hence can be implemented very fast.
- The merge operation assumes that the number of buckets is fixed between the sets being merged. When sizes are different, there is a way to ‘fold’ the HLL data structures with larger number of buckets into a HLL with the smallest number of buckets. This is described well here. I suppose the scenario of different bucket sizes arises if we are using some library where we could create a HLL with a specified bucket size. Of course, this is not applicable for Redis.
Intersections in HLL are not loss-less. Again seeking into our intuitive explanation for unions, imagine we replay a merged set where only common items in 2 sets are included. If we add the distinct elements of the sets into this HLL, it is easy to see that the value of a bucket could be masked by a larger distinct value present in either of the original sets.
One possible way to workaround this limitation is to explore if it will be ok to maintain another key to just manage the intersection. For example, to satisfy the use case above, we could maintain one HLL for users who visited the web page, say users:<page>, and another for users from every locality, like users:<page>:<locality>. A stream processing framework will update both keys for an incoming event.
The good parts about this approach are:
- Number of updates will be bounded by the number of combinations of dimensions we want to count for. This can be done in a streaming manner for a small number of combinations.
- Reads of intersection counts will be fast and accurate too.
The issues with this approach are:
- It is easy to see that this can become a combinatorial nightmare with many different combinations of dimensions to maintain.
- Each intersection key would be more storage space, and hence causes more load, particularly for in-memory systems like Redis.
- The method would only work if all the dimensional information came in the same event. For e.g. if we got information about users visiting pages from one source and user-locality information from another, there would no way of updating the intersection key in a straightforward manner without doing a join.
*** Best Method to Intersect Huge HyperLogLogs in Redis. There are two technique basically –
1. Inclusion Principle – Error margin is high in large set
2. Jaccard Index Intersection/MinHash – this seems to be the way to go.