Gain insight into our Observability Tools

Dominik Roos
Posted by Dominik Roos on 27 May, 2021
Observability SCION

In this series of blogs or “dev journal”, we will cover some issues, challenges, and of course, small victories that we encounter while developing tools, services and features for SCION. Hopefully, this will grant you insights into our development processes and let you in on some tips and tricks we picked up along the way.

In this first installment, we will give a short introduction to observability, what observability tools we are using as part of our development process, and how our tooling has evolved over time.

SCION: One network, many participants

At Anapaya, we are committed to developing and delivering new features for our SCION services while making the services more robust at the same time. We, as engineers, often face the challenges that a distributed system inherently brings along during development and testing. While, in SCION, we are not running any large distributed system inside our Autonomous Systems (ASes) in the traditional sense, we are running a substantial amount of ASes as part of our testing and production topologies. This leads to multiple different processes interacting with each other over the network. As a result, there is a collective state in the system that cannot be observed by inspecting one single process on its own. For us, this means that when developing, we quickly reach the limits of traditional debugging techniques like step-through debugging or printf() debugging. This makes it quintessential for us to use tools that grant us insights into the many concurrently running processes in a centralized way.

In other words, we need observability in the running system. In the following, we will describe what observability is and how a system can be instrumented to be observable.

Observability Tools to the rescue

The term observability originated in control theory. It is a measure of how well a system's internal state can be inferred from the externally observable outputs. Observability has a similar meaning in software. Essentially, it is the art of making the software's internal state observable such that questions about performance, process interactions and causes of failure can be answered. The underlying goal is to optimize and streamline any potential investigation process.

Observability has become an increasingly popular concept in recent years. With the rise of modern distributed applications, a suite of observability tools has been developed to support software engineers and operators to gain insight into these highly complex, multi-layered architectures. These insights are crucial to understanding causes of failure and performance bottlenecks. They are an invaluable tool for operators that respond to incidents, and they make it possible for software engineers to assess the most taxing bottlenecks and address them accordingly.

Traditionally, the observability space can be divided into "three pillars of observability":

The first pillar of observability is founded by metrics, which capture and describe quantitative data about running application processes. Metrics are a collection of statistics and aggregates that enable engineers to understand the overall behavior of the system and its usages. Metrics are typically collected and stored in time series which makes it easy to analyze and query them.

Some examples of questions that can be understood using the collected metrics are:

  • What is the average request rate?
  • What is the average request latency?
  • What is the distribution of the request latencies?
  • What is the current memory usage of a given process?
  • How many SCION paths are alive between two ASes?
  • How much traffic is sent per SCION interface?

Metrics are also fundamental to monitor the health of a system and create alerts when unexpected characteristics are observed. For example, metrics can demonstrate unusual request latencies, atypical resource usages, or increased error rates.

The second pillar of observability is logging, which captures discrete events and makes them accessible as text or structural data. A discrete event could for example be an incoming request on a web server or a failed database access. Logging helps engineers comprehend what events occurred when running an application. Where and how these logs are stored depends on the use case. Traditionally, logging events have been stored in log files on the local file system. Alternatively, log events can be aggregated centrally which makes them easier to access, search, and correlate. There are many different flavors of logging, ranging from quick print statements that are introduced in a debugging session, to more permanent logging statements that respect logging conventions.

The final pillar of observability is distributed tracing, which captures the flow of requests through different applications by capturing certain events and collecting them with the right metadata, correlating these events across the whole system. As a result, it is possible to track exactly what each individual application is doing while processing a given request. This makes it possible to ask and answer questions about individual interactions, contrary to metrics that operate on aggregates.

Our observability stack for development

In order to get a fast feedback loop that guides us when developing our software, we use a suite of tests that operates at different levels. From unit tests, which test individual code units, over integration tests, which tests whether different components integrate correctly with each other, to full-fledged quality assurance tests which operate in a simulated production environment.

As part of this test suite, we have a set of tests that operate on a lightweight docker-compose-based SCION topology. In this topology, we run all the control and data plane SCION services as docker containers. They are connected through the regular docker networking stack. This allows us to abstract away the more annoying details that we need to model for our full-fledged topologies. For example, we do not need to worry about the operating system of the individual hosts that run the services or how the hosts are connected physically in our docker-compose-based topologies.

This testing topology is so lightweight that it runs locally on the developer's machine and can be spun up in a couple of seconds. It is targeted at testing the interaction between running SCION services and analyzing their communication patterns. Some of our topologies run hundreds of SCION-related processes because they are composed of many SCION ASes that each run multiple different processes. With this sheer amount of moving parts, we need tools that help us aggregate information and allow us to quickly analyze where problems occur and what causes them.

Our current development observability stack is a combination of all three pillars that we described before. It is composed of the following tools:

  • Grafana is a data visualization tool and at the very center of our development observability stack. It allows us to look at all of the aggregated logs and metrics in one single place. Our grafana instance is provisioned with dashboards that mirror the ones deployed in production. This helps our developers get familiar with the tools that are available for debugging problems in production. It also helps our developers to identify what parts are missing, or what could be extended for better usability.
  • Prometheus is a monitoring and alerting toolkit. We use it to aggregate the metrics produced by our SCION services in the testing topologies. This is done by scraping all the metrics endpoints of the services and storing them in time series. This data is then exposed to grafana.
  • Loki is a log aggregation system. We use it to aggregate all the logs produced by our SCION services in the testing topologies. For this, we use the loki docker driver which collects all the logs from the docker containers in the testing topology, while keeping regular docker workflows intact. The native way of inspecting container logs via the "docker logs" command is unaffected by the plugin. This is especially useful because it still allows you to have a quick look at the logs of one particular service on the command line.
  • Jaeger is a distributed tracing platform that enables us to collect and analyze traces. We use the all-in-one image that bundles agent, collector, query and UI services in one. Jager is invaluable to visualize request patterns across the topology and quickly identify causes for failing requests.

The figure displays grafana with a provisioned dashboard on the left side. The dashboard displays graphs related to the SCION router. On the right side, the jaeger UI is displayed. It shows an overview of the last 100 traces from our end-to-end tests sorted by decreasing execution time.

 

We consciously decided to use the same tools that we use in production as part of our observability stack for development. While configuration and setup are slightly different between the docker-compose-based testing topology and actual production, the interactions with the different components are exactly the same. This gives us familiarity with the tools across all teams. The goal is to have developers that are familiar and proficient with the tools that we use in production when an incident is escalated to them. Thus they are able to swiftly jump into the investigation without learning the tools first.

The exception here is jaeger, which we currently do not deploy in live production. Distributed tracing has some drawbacks if multiple independent parties are involved, which our ASes are inherently. Also, the large amount of data that is produced by distributed tracing is a concern. During development, we do not have these limitations and distributed tracing has become a very effective tool in our toolkit.

Observability is constantly evolving

This combination of observability tools is very powerful for our different development workflows, but of course, this is not the set of tools we originally started out with. Our observability stack has grown with time and experience and we picked up a few tricks along the way which we share below.

In the beginning, our services exposed metrics mostly for the sake of our production deployment. Alerting and monitoring of the deployed services were based on these early metrics, but they were not integrated into the regular development workflow. The major problem here was that, while writing code, developers often did not consider what metrics could be useful. As a result, metrics were patched as an afterthought upon request from our operations team.

Today, we believe that developers actually have the best situational awareness to decide what metrics should be exposed by the component they are working on. They know exactly what the purpose of the component is, how it is implemented, and how it interacts with other components in the system. They also have a better understanding of what information will be crucial to understand the state of their component when investigating incidents. We made it our objective to think of metrics as first-class citizens during the development cycle. This is reflected in how we instrument code with metrics, and how the instrumentation pattern evolved over time. Unfortunately, this is beyond the scope of this blog post and will be touched on in a future installment. This paradigm shift also is reflected in how the testing topologies are set up for development. We added prometheus and grafana instances in our testing topologies to give our developers the same tools available in production. Thus, they can experience the metrics they expose in their component firsthand and early in the development cycle by spinning up the lightweight testing topology. This method proves to be effective, as the requests for missing metrics have gone down significantly.

We had a similar evolution for our approach to logging. Initially, we were missing an efficient way of correlating different log events with each other. We had some tooling to compose multiple log files into one single file by interleaving the log events according to their timestamps, but that addressed only a small part of the problem. It was still hard to assess what log events belong to the same execution trace. Both intra- and inter-process correlation took quite a bit of brainpower, intrinsic knowledge of the whole code base, and a lot of time. Correlating logs is the bread and butter of understanding the causal relationship of events that happen in a system. As such this is very important to understand the causes for a misbehaving system.

In an effort to make debugging easier, we introduced the concept of a debugging identifier to our log events. The debugging identifier is an 8 character long hex-encoded string that is attached to each log event that shares the same execution trace. For example, if an application handles a request from a client, each log event that is emitted while handling this specific request has the same debugging identifier attached to it. The identifier is chosen at random by the component that starts an execution trace and propagates it to all other components that are invoked during handling. In the client request example, this is done by the RPC stack, which chooses the identifier when it receives the incoming request. This debugging identifier allowed us to correlate all the log events in a single process, which was already a very important productivity boost. However, the debugging identifier did not propagate across process boundaries, making it hard to correlate log events across processes.

The missing piece to the puzzle was distributed tracing. We will explore the details of distributed tracing in a future blog post but the key observation is: to create a distributed trace across the different processes in the system, a trace identifier is passed along from one process to the next. Similar to the debugging identifier we described above, we attach the trace identifier to each log event that shares the same distributed trace. This makes it possible to now correlate log events across processes.

 

The figure displays grafana in exploration mode. On the left side, we see all the logs that are part of the requested trace identifier. On the right side, we see the whole execution trace across all services in a visual representation.

 

Here, the combination of jaeger, loki and grafana shine. We can filter log events with LogQL such that only events of the same trace are displayed. Furthermore, we can visually inspect the execution trace across the whole system with the click of a button.

This combination of jaeger, loki and grafana has been a great productivity boost. Gone are the days of manually combing through the individual log files and searching for clues about what might be wrong with the system. However, there are always opportunities to optimize and automate processes. For example, we are intrigued by the approach OpenTelemetry takes and might think about how this fits into our system. I'm excited to see where this journey leads us.

In this article, we have given a brief explanation of what observability is, what observability tools we use at Anapaya during development, and how our approach to observability has evolved over time. In the next blog post of this series, we will take a closer look at distributed tracing, and how we can leverage it during our development process.

Click here, if you want to learn more about the benefits of SCION technology. Or you can contact us directly. We're happy to talk, any time.