Event Streaming

Feb 3, 2018 12:22 · 1266 words · 6 minute read data engineering event stream

Do you know what happens in your product?

When you’re part of building a web product it’s easy to forget that your experience is not everyone’s experience. This is painfully obvious when users hit bugs that you’ve never hit, they take screen shots of ridiculous things your product does that you had no idea was possible, or describe your product to you in a way that seems entirely foreign.

Working at a search engine for a while helped me learn two important things: I don’t seem to use the internet anywhere close to the way a vast majority of people do and capturing, to the best of your ability, information about the actions users take is one of the only ways to figure that out.

Infer or catalog?

Don’t bake in your inference

One common pattern I see in attempting to figure out what people do in a product is what I call the “a priori” approach. It is characterized by a group of people (usually engineers and product folks) sitting around and thinking up all of the possible user sequences. It results in things like:

Well, if the user clicks button A and then navigates over here to pane B and then leaves the page they probably saw element C and didn’t like it so they navigated away. So let’s just measure every time they leave the site after clicking A if B shows up and C is on the page.

If you’re a data person you probably have this face on right now: Say What?

The above scenario is a good discussion to be having but often results in the following code:

# Fire async event tracking
if user_clicked_A and user_went_to_B:
    # Check if user abandoned page
    if user_does_not_do_anything_for_X_time():
        logger.info(
            "User {} didn't like C after clicking A and going to B".format(user_name)
        )

This is a bit over simplified but teams are so scared of multiple testing, p-hacking, [insert scary stats thing here] that they often only engage in analytics efforts if they are hyper specified. The problem here is that unless the real world exactly conforms to your hypothetical situation the info you get from this log signal is likely to give a false sense of confidence in the resulting outcome.

By approaching the problem this way you end up baking inference into the tracking code itself rather than keeping your tracking code as close to factual record as possible and letting inference happen later on. My assertion is that, in general, it is always better to atomically track factual events rather than instrument your product to do inference for you.

This is more than just a stated preference for how to deal with the above scenario. It’s a suggestion to view your product, from the data perspective, as a series of discreet events that result in a timeseries consisting of structured event data. Or, as I put it in a talk once:

Your users are a collection of log lines

My perspective has evolved over the years and I have more of a UX brain now than I did then but the sentiment remains. You can checkout that talk here:

Build a catalog instead

Thinking of your product as a stream of events allows you to trade the problem of performing all of your inference up front for the problem of scaling an event collection and query system that works for you business. I’d rather have the second problem. Progress towards such a system results in code that looks more like:

# Code that fires anytime a user clicks anything
def click_capture(context):
    response = tracking_library.track(
        user_id=context.user_id,
        action='click',
        value=context.click_element,
        extra_info=context.extra,
    )
    return response

# Code that fires anytime a user visits navs to a page
def page_capture(context):
    response = tracking_library.track(
        user_id=context.user_id,
        action='page_view',
        value=context.page_element,
        extra_info=context.extra,
    )
    return response

This code doesn’t directly answer the question posed by the team at all but combined with a way to query the resulting events it can do that and even more. Plus, it’s general enough that it empowers question askers to trade complex a priori reasoning for post hoc effort in expressing their questions in a manner that can be answered by the copious amounts of data the system collects.

Easier than you think

But Chris…I don’t have a team of a billion data engineers to create this infrastructure for me!

While it’d be great to have a large and capable data team instrumenting your product, verifying every nook and cranny of the data, reliably delivering it to you in a quick fashion, and creating simple but powerful ways to access the data that doesn’t mean you can’t start simple and iteratively improve in the absence of this nirvana.

Like I mentioned above, getting something plug and play like Google analytics can get you started and at least enable you to start collecting a list of things you’d like to know but don’t about your product.

Once you commit to explicitly instrumenting your product you could either do a third party service (like Segment) or roll your own.

In general you need four things:

Client library to put events
An endpoint that receives the events
A place to store those events
A way to query those events

Often I see people try to re-purpose logging for event tracking and while you can do that I’d encourage people to consider two distinct types of information capture in the product: raw logging vs. structured events. Logging is there to capture information about the state, function and behavior of your system while events are suppose to capture structured information about actions and information that directly relates to the user experience.

Start early

Waiting until you have the exact system sorted out with the perfect schema and ideal workflow is a good way to avoid ever getting any positive benefit out of your system. An event stream is one of those things that, without someone experienced around, is hard to justify prior to building it. Your company won’t know what kind of questions it can ask about event data as they’ve likely never had access to that kind of thing on a regular basis.

Plus, the absence of an event stream combined with a priori reasoning can give an company a negative outlook on the value of tracking their product as every question seems like a herculean effort to answer and analysis results tend to be unsatisfying.

In my experience, it’s worth taking a chance on creating such a system as it’s very easy to get going in the first place. Early on you will begin to ask fundamental questions about your product and the most basic ones like:

How many users did something with the product yesterday?

can drive more impactful ones like:

Given that a user signed up on a free trial 8 days ago and hasn’t engaged with features A or B but routinely is a top user of feature C, how likely is it this user will convert to a paid plan in the next 3 days?

Having the ability to answer the first question is table stakes for understanding your business and engineering needs/performance. The ability to answer the second could be a competitive differentiator for your company over the long haul.

Storage is cheap, start instrumenting your product today and trade the problem of scaling your data collection and querying it for the ability to have deep information about what it is that users do with your product. Trust me…they don’t use it like you think they do and if you ask them they don’t have the ability, time, or incentive to explain it to you anyway.