Odd notebook performance behavior

Mark Reid mreid at mozilla.com
Wed Mar 15 14:05:59 UTC 2017


This is not entirely unexpected when using the Dataset API. The method by
which the "sample=.01" parameter is applied is in terms of files stored on
S3. As pings come in, the ingestion pipeline batches them up by several
dimensions (including channel, build, fx version, etc) and saves them to S3
when they reach a certain size or after a certain timeout.

This generally results in a large number of small files (for uncommon
long-tail combinations of dimensions such as nightly builds) and a smaller
number of large files (such as release on the current version).

When loading the data back out of S3, the Dataset code batches these S3
objects into approximately-equal sets based on size, but if there are only
a few large files to read, they will unbalance some of the partitions. At a
1% sample, it's possible there were only a handful of large files which
become the bottleneck to the task completing.

Ideally the S3 objects would be better balanced size-wise, but since we
need to ensure a maximum upper bound on latency until data becomes
available for processing, we need to eventually flush accumulated data to
long term storage even if we've only seen one record for a set of
dimensions.

This definitely impacts on cluster efficiency, and should be improved as we
move towards more "direct to parquet" outputs[2]. This will let us
partition by fewer dimensions on S3 and instead take advantage of the
ability to efficiently scan and filter the parquet data directly. The good
news is that the more data you read from S3, the more balanced the reading
becomes, so at least this particular problem is worst on the small case,
not the large case.

Thanks for the report!

Mark

[1]
https://github.com/mozilla/python_moztelemetry/blob/master/moztelemetry/dataset.py#L185
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1304412

On Tue, Mar 14, 2017 at 3:23 PM, Eric Rescorla <ekr at mozilla.com> wrote:

> Hi folks,
>
> You might be interested in the following notebook:
> https://gist.github.com/ekr/5dbd14316554c87ebecf49dec3c2b543
>
> The behavior I see in the last cell (un-named) is that it takes 2 minutes
> (there are about 750k records) but the progress indicator jumps up to
> 636 processes done and then grinds away until it hits 640. I don't know
> if this is too slow or not (it seems a bit slow) but this seems indicative
> that maybe we're not making good us of the cluster, so I thought I
> would mention it.
>
> -Ekr
>
>
>
>
> _______________________________________________
> fhr-dev mailing list
> fhr-dev at mozilla.org
> https://mail.mozilla.org/listinfo/fhr-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20170315/05240e07/attachment.html>


More information about the fhr-dev mailing list