Metrics, Logging and Data Collection

We have touched on this topic briefly before in our Python at Scale, and CI/CD forums, but it was nice to sit down and dig deeper into the strategies, technologies, and details of implementation.

Logging

This is as good a place to start as any, and due to how accessible it is in nearly every programming language logging is something almost every application and library has to some degree. That being said, in general we see two distinct logging sources in a studio environment: pipeline logging and service logging. You could also say that pipeline logs are client- or desktop-side logs, whereas service logs are generated by long running remote services.

Our conversation on logging was largely a rapid-fire question and answer session, where we hashed out some best practices or at least tips for logging:

logging is not critical and should never fail; you don’t want production to stop when the log file can’t be written to
leverage any built-in logging or data collection system for your technology stack (docker, kubernetes, systemd, etc), it will save you time and make your life easier
only configure the logging API (aka python logging library) from the application entry point; no libraries should be trying to do this as it is not their responsibility
environment variables are a really accessible and simple way to configure application logging in most runtime systems, so consider this if designing an internal API
use warnings sparingly, and always include a solution; they are easy to ignore and are arguably quite ambiguous to the user
json logs are a great way to produce and collect structured logging information in modern logging stacks; most logging APIs will allow you to display human readable logs while outputting json logs for the system to collect
know your log audience, and include accessible information about what broke as well as how to fix it

Error Reporting

The conversation on logging eventually turned into a more specific discussion on error capturing and reporting. Sentry came up quickly and repeatedly as an application worth checking out, but there are other products that exist and in-house solutions to the same idea. Of course, this is the idea of automatically capturing errors from an application, and then reporting, organizing, monitoring, and tracking them without the user doing anything.

One of the unfortunate downsides of all of these systems, is that they are largely based on the assumption that each application or service is one project that errors can be reported for. In our environment, however, this is rarely the case; We instead have a multitude of plugins and tools maintained by separate people or teams all running together in the same application. This means that the errors reported by a system like Sentry really need an additional amount of categorizing and managing to ensure that the right people see the right errors. That being said, collecting them is an excellent step one, and usually provides immediate and valuable information on the state of your runtime.

The discussion on error reporting systems though, also brought up an interesting discussion on what an error really is. It’s really easy to report uncaught exceptions automatically, but as any TD can tell you: there are an infinite number of “problems” that can arise which do not manifest an an exception. Given this, how do we collect the right information to diagnose and identify these problems. Can we even know what that set of relevant information is ahead of time?

Metrics and Telemetry

Logging and error capturing both fall under this heading as well, but at this point we got a little deeper into our discussion on the data itself. Given a technology stack to collect information from the pipeline and services, what data do you actually collect? We had quite a polarized discussion on whether you should collect all the information that you can, or if it’s not smart to collect any metric until you understand how it will be applied to a defined problem.

Without rehashing all the back and forth, we seemed to settle at least partially on an understanding that there are different types of data which you can collect. The following is not a formal definition, but tries to summarize the key use cases and scenarios that we could agree on:

Leaf Metrics

A leaf metric is one with important consequences regardless of its underlying cause or inputs. This information is not collected with the intention of diagnosing problems, but instead is collected to help identify when a problem might be occurring or when action needs to be taken.

A good example of this is application start-up time. Users can feel this when it changes, and being able to plot this over time will always be helpful because it gives you insight into the general usability of that application.

On the service side, a good example of this might be CPU utilization of a server. When a service is using too much CPU, you want to look at scaling out so that users are not affected, and this is entirely secondary to understanding why.

Intermediate Metrics

On the other side of the coin, we start to get into domain specific and arguably ambiguous metrics. These are the ones that come with a little bit of controversy, because they have a very real potential to be misinterpreted or misused when identifying or diagnosing problems in a system. All metrics have a complex set of input variables that determine their final value, but where leaf metrics have a clear and understood downstream impact, relative ones do not.

An example of this might be something like user idle time in an application window, where we measure how much time users spend in a given interface, and how much of that time they are not typing or using the mouse. It’s temping to think, for example, that idle users are having difficulty understanding and using the interface, but maybe they just got up to go to the bathroom. Maybe the interface displays notes and it just so happens that this user gets longer notes or is a slow reader. We want to avoid drawing conclusions from this type of metric, because we can’t fully understand their true impact on the effectiveness and health of the system.

The question, then, is do you collect these intermediate metrics anyway? Does data rot with age? And this is where we don’t always agree. On the one hand you can never go back and collect data that you don’t have, but on the other hand you can so easily be mislead by a piece of information that wasn’t collected in a carefully controlled environment with the purpose of exposing a specific aspect of a known problem or investigation. Probably there is no one right answer to this, and we should approach each situation anew.

Maybe there is some kind of significant digits for metrics? Where at some point it’s too granular or specific - misrepresenting its accuracy or the stability of the context in which it was measured…

Knowledge is Power

Setting aside the intricacies for a minute, then, what would we love to know in a perfectly unambiguous world?

code usage; What modules, functions or even lines are used? So that we actively retire or deprecate code and code paths that are irrelevant
code hotspots, or mass profiling; It would be great to know where small optimizations could make the biggest difference
outliers; Who, what or when are things operating so radically differently that there must be something we didn’t know or didn’t design for.
‘slowness’; this may be a pipe dream, but the right data to gain performance insight under all of our various situations would be lovely

Accessibility

We touched on this briefly with errors, but the idea of data accessibility is really focused on ensuring that the right people have easy access to the right data at the right time and in a useful way. That is a lot of words, but basically means that collecting it is not enough. You can set up the best tech stack in the world and collect information from every corner of your pipeline, but until people can effectively leverage that data to save time, solve problems or improve workflows then it’s not worth a penny - and people aren’t going to use it if it’s too hard or they don’t know about it.

As a start to this, we discussed here was really around planning for the data. This means making sure that developers have put some thought into what they want to collect; how the collected data will be structured; and what the life-cycle of the data will be. Once this is approved and implemented review it regularly, and adjust the plan as necessary; make people responsible for the data that they collect. Without this, it’s easy to bloat a system with useless or badly configured metrics that are forgotten and never cleaned up. This also provides a base for documentation and convention, making the data more approachable to anyone who might have something to learn.

Once some good data is collected, a useful set of GUIsGraphical User Interface, dashboards or other tools to help people interact with the data is imperative. Many tools and web apps already exist, and should definitely be leveraged, but also don’t shy away from investing some time and energy into domain-specific interfaces for data that is used often and in context. For example, rig performance data is much more useful if it can be pulled up alongside a rig or rig selections right within Maya.

Metrics, Logging and Data Collection
Thursday, Sep 12, 2019

Logging

Error Reporting

Metrics and Telemetry

Leaf Metrics

Intermediate Metrics

Knowledge is Power

Accessibility

Further Reading and Links

Metrics, Logging and Data CollectionThursday, Sep 12, 2019

Logging

Error Reporting

Metrics and Telemetry

Leaf Metrics

Intermediate Metrics

Knowledge is Power

Accessibility

Further Reading and Links

Metrics, Logging and Data Collection
Thursday, Sep 12, 2019