spark-hyperloglog now pre-installed on ATMO Inbox

Jeff Klukas jklukas at mozilla.com
Wed Jul 25 14:36:55 UTC 2018


Looks like I partially misspoke here. Spark has a built-in
approx_count_distinct function that uses the HyperLogLog algorithm under
the hood and you should definitely prefer that over the spark-hyperloglog
package if you need distinct counts in an analysis.

The advantage of spark-hyperloglog is that it allows us to create HLL data
structures ahead of time and put them in long-term storage. So you will
likely only need this package if you want to save HLL structures to an
intermediate dataset.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fx-data-dev/attachments/20180725/b11ee49a/attachment.html>


More information about the Fx-data-dev mailing list