[Hindsight] Repeated messages in elastiseach output

Michael Trinkala mtrinkala at mozilla.com
Tue Nov 29 17:58:57 UTC 2016

#inputs == number of HS input plugins in a single group that you have
reading from a topic.

It is sounding more like your Kafka server configuration.  Have you altered
any of the retention configuration values on the Kafka server? This can
impact how long the offsets are stored after the consumer group goes away
(however, in this context that should cause a loss of messages).  I would
also look at the offset flush intervals and timeouts and make sure updates
are being propagated with the Kafka consumer group command tool. At this
point all offset management is controlled by Kafka and the high level
consumer and is out of the plugins control but you can enable Kafka
consumer logging in the sandbox (and run HS at log level 7) to get more
insight into what is happening.

We are successfully moving up to 500MM messages a day through consumer
groups, we have restarted some systems, and are not seeing the duplication
you describe.


On Tue, Nov 29, 2016 at 1:51 AM, Tomas Barton <barton.tomas at gmail.com>

> Thanks Trink. What do you mean by #inputs?
> I'm fetching messages from single topic which has 8 partitions. I've tried
> several configurations while #processes <= #partitions
> > If all of your input plugin configs are using a group.id of 'logs-0'
> Yes, all processes uses the same configuration.
> So far I was able to verify that using single consumer process per topic
> works fine. Probably the issue is not on the side of output
> (Elasticsearch). I wasn't able to find any duplicate uuid in the index
> (which is strange, I'll try to investigate that further). However I've
> noticed certain "avalanches" of messages probably caused by consumers'
> restart - normally the throughput is pretty much constant in given time
> window but during some period it was at least 10-100x higher than usual and
> after cca 30 minutes data stopped to show up in Kibana (probably Elastic
> was choking up).
> I'm a bit suspicious about consumer's reset behavior (default, haven't
> changed that yet):
> auto.offset.reset = largest
> I'm using a task scheduler (Mesos) for running consumer jobs. After
> restart a consumer might be running on a different node and it's supposed
> to continue where its predecessor left. How does kafka identify consumers?
> I've noticed that broker is storing consumer's IP and client.id. Is it
> possible that after consumer restart unflushed messages would be read again
> by another consumer (or maybe even by multiple consumers)?
> Is it possible to enforce some kind of consumer identification which would
> ensure that e.g. consumer-1 would be considered as the same no matter which
> IP it's running on?
> Tomas
> On 28 November 2016 at 23:17, Michael Trinkala <mtrinkala at mozilla.com>
> wrote:
>> Our production data warehouse loaders use a balanced consumer group
>> without any message duplication.  If all of your input plugin configs are
>> using a group.id of 'logs-0' and you have #inputs <= #topics then
>> everything should be working fine.  I would start by dumping the Kafka
>> consumer groups/offsets it thinks it is processing.  If you provide those
>> dumps and the full configs there will be a better chance to diagnose the
>> issue.
>> Thank,
>> Trink
>> On Mon, Nov 28, 2016 at 1:28 AM, Tomas Barton <barton.tomas at gmail.com>
>> wrote:
>>> Hi,
>>> I'm trying to configure Hindsight to pull messages from Kafka and store
>>> then to Elasticsearch. I'm using Hindsight 0.12.7 and latest version of all
>>> other modules.
>>> Before using Hindsight the number of messages per day in given topic was
>>> around 10M now the number of messages is at least doubled when 2 Hindsight
>>> consumers are used.
>>> Probably I misunderstood consumer group concept. Earlier I was using 1
>>> consumer per topic partition. The configuration is pretty much default:
>>> -- In balanced consumer group mode a consumer can only subscribe on
>>> topics, not topics:partitions.
>>> -- The partition syntax is only used for manual assignments (without
>>> balanced consumer groups).
>>> topics                  = {"logs"}
>>> -- https://github.com/edenhill/librdkafka/blob/master/CONFIGURA
>>> TION.md#global-configuration-properties
>>> consumer_conf = {
>>>     ["group.id"] = "logs-0", -- must always be provided (a single
>>> consumer is considered a group of one
>>>     -- in that case make this a unique identifier)
>>>     ["message.max.bytes"] = output_limit,
>>> }
>>> -- https://github.com/edenhill/librdkafka/blob/master/CONFIGURA
>>> TION.md#topic-configuration-properties
>>> topic_conf = {
>>>     -- ["auto.commit.enable"] = true, -- cannot be overridden
>>>     -- ["offset.store.method"] = "broker, -- cannot be overridden
>>> }
>>> Now I have multiple consumers within the same consumer group, according
>>> to Kafka documentation:
>>> Kafka will deliver each message in the subscribed topics to one process
>>>> in each consumer group.
>>> So having all processes in one group seems to be a good idea when each
>>> message is supposed to be stored just once. But in reality it looks like
>>> each process in the same consumer groups is reading all the messages.
>>> Btw. Kafka version is and each topic's partition seems to be
>>> owned by some consumer.
>>> Any idea what could be wrong?
>>> Thanks,
>>> Tomas
>>> _______________________________________________
>>> Hindsight mailing list
>>> Hindsight at mozilla.org
>>> https://mail.mozilla.org/listinfo/hindsight
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/hindsight/attachments/20161129/60465afe/attachment.html>

More information about the Hindsight mailing list