spark-hyperloglog now pre-installed on ATMO Inbox
Jeff Klukas
jklukas at mozilla.com
Wed Jul 25 14:36:55 UTC 2018
Looks like I partially misspoke here. Spark has a built-in
approx_count_distinct function that uses the HyperLogLog algorithm under
the hood and you should definitely prefer that over the spark-hyperloglog
package if you need distinct counts in an analysis.
The advantage of spark-hyperloglog is that it allows us to create HLL data
structures ahead of time and put them in long-term storage. So you will
likely only need this package if you want to save HLL structures to an
intermediate dataset.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fx-data-dev/attachments/20180725/b11ee49a/attachment.html>
More information about the Fx-data-dev
mailing list