[Hindsight] Repeated messages in elastiseach output

Tomas Barton barton.tomas at gmail.com
Tue Nov 29 09:51:19 UTC 2016

Thanks Trink. What do you mean by #inputs?

I'm fetching messages from single topic which has 8 partitions. I've tried
several configurations while #processes <= #partitions

> If all of your input plugin configs are using a group.id of 'logs-0'
Yes, all processes uses the same configuration.

So far I was able to verify that using single consumer process per topic
works fine. Probably the issue is not on the side of output
(Elasticsearch). I wasn't able to find any duplicate uuid in the index
(which is strange, I'll try to investigate that further). However I've
noticed certain "avalanches" of messages probably caused by consumers'
restart - normally the throughput is pretty much constant in given time
window but during some period it was at least 10-100x higher than usual and
after cca 30 minutes data stopped to show up in Kibana (probably Elastic
was choking up).

I'm a bit suspicious about consumer's reset behavior (default, haven't
changed that yet):

auto.offset.reset = largest

I'm using a task scheduler (Mesos) for running consumer jobs. After restart
a consumer might be running on a different node and it's supposed to
continue where its predecessor left. How does kafka identify consumers?
I've noticed that broker is storing consumer's IP and client.id. Is it
possible that after consumer restart unflushed messages would be read again
by another consumer (or maybe even by multiple consumers)?

Is it possible to enforce some kind of consumer identification which would
ensure that e.g. consumer-1 would be considered as the same no matter which
IP it's running on?


On 28 November 2016 at 23:17, Michael Trinkala <mtrinkala at mozilla.com>

> Our production data warehouse loaders use a balanced consumer group
> without any message duplication.  If all of your input plugin configs are
> using a group.id of 'logs-0' and you have #inputs <= #topics then
> everything should be working fine.  I would start by dumping the Kafka
> consumer groups/offsets it thinks it is processing.  If you provide those
> dumps and the full configs there will be a better chance to diagnose the
> issue.
> Thank,
> Trink
> On Mon, Nov 28, 2016 at 1:28 AM, Tomas Barton <barton.tomas at gmail.com>
> wrote:
>> Hi,
>> I'm trying to configure Hindsight to pull messages from Kafka and store
>> then to Elasticsearch. I'm using Hindsight 0.12.7 and latest version of all
>> other modules.
>> Before using Hindsight the number of messages per day in given topic was
>> around 10M now the number of messages is at least doubled when 2 Hindsight
>> consumers are used.
>> Probably I misunderstood consumer group concept. Earlier I was using 1
>> consumer per topic partition. The configuration is pretty much default:
>> -- In balanced consumer group mode a consumer can only subscribe on
>> topics, not topics:partitions.
>> -- The partition syntax is only used for manual assignments (without
>> balanced consumer groups).
>> topics                  = {"logs"}
>> -- https://github.com/edenhill/librdkafka/blob/master/CONFIGURA
>> TION.md#global-configuration-properties
>> consumer_conf = {
>>     ["group.id"] = "logs-0", -- must always be provided (a single
>> consumer is considered a group of one
>>     -- in that case make this a unique identifier)
>>     ["message.max.bytes"] = output_limit,
>> }
>> -- https://github.com/edenhill/librdkafka/blob/master/CONFIGURA
>> TION.md#topic-configuration-properties
>> topic_conf = {
>>     -- ["auto.commit.enable"] = true, -- cannot be overridden
>>     -- ["offset.store.method"] = "broker, -- cannot be overridden
>> }
>> Now I have multiple consumers within the same consumer group, according
>> to Kafka documentation:
>> Kafka will deliver each message in the subscribed topics to one process
>>> in each consumer group.
>> So having all processes in one group seems to be a good idea when each
>> message is supposed to be stored just once. But in reality it looks like
>> each process in the same consumer groups is reading all the messages.
>> Btw. Kafka version is and each topic's partition seems to be
>> owned by some consumer.
>> Any idea what could be wrong?
>> Thanks,
>> Tomas
>> _______________________________________________
>> Hindsight mailing list
>> Hindsight at mozilla.org
>> https://mail.mozilla.org/listinfo/hindsight
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/hindsight/attachments/20161129/2e9d7d6d/attachment-0001.html>

More information about the Hindsight mailing list