spark-hyperloglog now pre-installed on ATMO Inbox

Jeff Klukas jklukas at mozilla.com
Wed Jul 25 13:38:46 UTC 2018


Hi data users,

We deployed a change yesterday that pre-installs the spark-hyperloglog [0]
Scala package and python bindings on all new ATMO [1] clusters.
HyperLogLog is an efficient algorithm for approximating the count of
distinct entries in a large dataset and spark-hyperloglog is the
implementation of that algorithm we use in our Spark-based data processing
jobs.

You should now be able to run `from pyspark_hyperloglog import hll` in your
notebooks on any new ATMO cluster without having to install additional
software. Using the hll function will be much faster than spark's built-in
count distinct functionality when working with datasets larger than a few
GB. The package is also available on Databricks clusters.

Message us in #datapipeline on IRC if you have any questions or notice any
unexpected changes in behavior on ATMO clusters or jobs.

Bug tracking this work: https://bugzilla.mozilla.org/show_bug.cgi?id=1466936

[0] https://github.com/mozilla/spark-hyperloglog
[1] https://analysis.telemetry.mozilla.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fx-data-dev/attachments/20180725/4b6b69b2/attachment.html>


More information about the Fx-data-dev mailing list