[Hindsight] Parquet example

Michael Trinkala mtrinkala at mozilla.com
Wed Feb 21 20:36:58 UTC 2018


External S3 uploader
https://gist.github.com/trink/399e8b923bcbc7095afba1ba0870d10a

Trink

On Wed, Feb 21, 2018 at 9:36 AM, Madhukar Thota <madhukar.thota at gmail.com>
wrote:

> Thanks. One last question, is the uploader process is open source to use?
> If not I will try to combine telemetry_s3.lua with s3_parquet for my
> usecase.
>
>
> On Wednesday, February 21, 2018, Michael Trinkala <mtrinkala at mozilla.com>
> wrote:
>
>> - The process is outside of Hindsight
>> - Yes that uploader works fine but since we have an external process we
>> didn't add it to s3_parquet
>>
>> Trink
>>
>> On Wed, Feb 21, 2018 at 5:39 AM, Madhukar Thota <madhukar.thota at gmail.com
>> > wrote:
>>
>>> Thanks Michael.
>>>
>>> fyi: This writes to a local disk queue, we have a separate process that performs the actually s3 upload.
>>>
>>> Is this process part of hindsight or some other process outside of
>>> hindsight?
>>>
>>> is it possible to use something like this with parquet:
>>> https://github.com/mozilla-services/data-pipeline/b
>>> lob/master/hindsight/output/telemetry_s3.lua
>>>
>>> -Madhu
>>>
>>>
>>> On Tue, Feb 20, 2018 at 11:54 AM, Michael Trinkala <
>>> mtrinkala at mozilla.com> wrote:
>>>
>>>> Here is an example of how we write a Heka message to parquet. fyi: This writes to a local disk queue, we have a separate process that performs the actually s3 upload.
>>>>
>>>> -- -*- lua -*-
>>>> filename        = "s3_parquet.lua"
>>>> message_matcher = "Type == 'telemetry' && Logger == 'telemetry'"
>>>> preserve_data   = false
>>>> ticker_interval = 60
>>>>
>>>> parquet_schema_file = "<%= @heka_schema_path %>/telemetry/telemetry_payload_size.1.parquetmr.txt"
>>>>
>>>> metadata_group = nil
>>>> json_objects = nil
>>>> s3_path_dimensions  = {
>>>>     {name = "submission_date_s3", source = "Timestamp", dateformat = "%Y%m%d"},
>>>> }
>>>>
>>>> batch_dir           = "<%= @s3_buffer_dir_disk %>/telemetry-payload-size-parquet/v1"
>>>> max_writers         = 5
>>>> max_rowgroup_size   = 10000
>>>> max_file_size       = 1024 * 1024 * 300
>>>> max_file_age        = <%= @max_file_age %>
>>>> hive_compatible     = true
>>>>
>>>>
>>>> -- parquet schema
>>>> message telemetry_payload_size {
>>>> required int64 Timestamp;
>>>> required int64 size;
>>>> required group Fields {
>>>> required binary appBuildId (UTF8);
>>>> required binary appUpdateChannel (UTF8);
>>>> required binary docType (UTF8);
>>>> }
>>>> }
>>>>
>>>>
>>>> On Thu, Feb 15, 2018 at 4:46 PM, Madhukar Thota <
>>>> madhukar.thota at gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Is there any example to send data syslog data from Kafka to s3 with
>>>>> parquet format using hindsight.
>>>>>
>>>>> This is what i am trying to achieve.
>>>>>
>>>>> syslog --> hindsight --> Kafka --> hindsight --> s3 (parquet format).
>>>>>
>>>>> Thanks,
>>>>> Madhu
>>>>>
>>>>> _______________________________________________
>>>>> Hindsight mailing list
>>>>> Hindsight at mozilla.org
>>>>> https://mail.mozilla.org/listinfo/hindsight
>>>>>
>>>>>
>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/hindsight/attachments/20180221/0d203029/attachment.html>


More information about the Hindsight mailing list