Feedback requested on Services Metrics proposal

Rob Miller rmiller at mozilla.com
Fri Oct 14 12:04:07 PDT 2011


On 10/13/11 8:13 PM, Richard Newman wrote:
> (Randomly diving in to parts of this. Sorry!)
>
>> * Regarding timers, I really like the concept but wonder if it is ideal to release one timer event instead of 2 distinct "start" and "end" events, as they are produced. The reason I like distinct "point" events is that they are emitted immediately and thus can be consumed by near real-time monitors for more rapid reaction, potentially while an event is in-flight. They also have benefits for replaying a system's behavior directly from the log stream, without needing to muck with the message timeline. You also don't have to maintain as much state in the producer, just an ID so a downstream system can correlate the start and end events for a particular pair. The downside is downstream systems need to pair up the events and there is overhead of an extra event emitted. But, I think the benefits are compelling.
>
> I concur. To add to Greg's thought:
>
> If the timer's enclosed code doesn't terminate within a reasonable amount of time, logging explicit start and end pairs (or just the start!) is waaay more useful than waiting to log until the code finishes!
>
> The term we used at Tellme for aggregating raw log events into more structured data was "sessionizing". Start and end correlation was one part of this. I presume that our metrics infrastructure has some similar capability for arbitrary stream processing of events. As you approach this problem of generating, delivering, and storing these raw log messages, it's worth thinking about the inevitable analysis layer that goes on top.

This is a great point.  As described in the propoal, so far we've 
identified 3 concrete back ends:

- statsd (for counter and timer events)
- sentry (for errors)
- bagheera / hadoop (for everything else)

The idea was that hadoop gives us the ability to right arbitrary 
map-reduce jobs or Hive queries to do any sort of analysis that we want. 
  And, as real world usage informs our needs, we might add more back 
ends, or we might decide to structure some of our messages in a way that 
makes the after-the-fact analysis we want to do a bit easier.

Any ideas anyone has for additional back ends that we know we're likely 
to want, or for additional specific message types that might inform how 
we structure thing, please let me know.

> The concept of a start and end is much more general than "time this block of code". What about logging the various events submitted during the J-PAKE flow, for example? No single block of code to time, and each part might want to be analyzed separately and in aggregate. From this I infer that analysis will end up having to do some kind of event interpretation regardless, so granular events emitted from `timer` aren't really much more costly than a single event.

Right.  And there's absolutely nothing preventing us from sending 
"start" and "end" events, and using them to do any kind of 
post-processing we want to do.  It's just that there's nothing 
particularly noteworthy about these events... they'd just be regular 
"metlog" calls, with type values of "start" and "finish", or similar.  I 
don't see a need for syntactic sugar.

"Time this piece of code" IS given some syntactic sugar, however, 
because a) it's something that is often requested, b) setting up 
threadsafe timers is non-trivial, and c) statsd explicitly supports 
messages that say "this activity took N ms to complete", and from those 
messages it will automatically generate 90th percentile, average, and 
lower and upper bounds for the duration value.

All in all, I'd say we're in violent agreement.  :)

-r


More information about the Services-dev mailing list