What is Distributed Tracing?

Distributed traces record the paths that requests take as they propagate through multi-service architectures, like microservice and serverless applications. Tracing is critical for observability in microservice architectures because it is the only form of telemetry that encodes the dependencies and relationships between the hundreds of components in modern architectures.

So, gathering high-quality tracing telemetry is of the utmost importance; and yet that’s a difficult thing to do without some sort of common approach, since distributed systems involve many different languages, frameworks, and infrastructure components. That’s why OpenTelemetry can help so much with distributed tracing: by providing a common set of APIs, SDKs, and wire protocols, OpenTelemetry gives organizations a single, well-supported integration surface for end-to-end distributed tracing telemetry.

Microservices and tracing

Microservices introduce significant challenges to tracing a request through an application, thanks to the distributed nature of microservices deployments. Consider a traditional monolithic application: with a code base centralized onto a single host, diagnosing a failure can be as simple as following a single stack trace. But, when an application consists of tens, hundreds, or thousands of services running across many hosts, it is no longer possible to rely on an individual trace. Instead, you need something that represents the entire request as it moves from service to service, component to component. Distributed tracing solves this problem, providing powerful capabilities such as anomaly detection, distributed profiling, workload modeling, and diagnosis of steady-state problems.

OpenTelemetry and tracing

Much of the terminology and mental models that we use to describe distributed tracing can trace their origin to systems such as Magpie, X-Trace, and Dapper. Dapper, in particular, has been highly influential to modern distributed tracing efforts, and many of the mental models and terminology that OpenTelemetry uses can trace their origin to that project. The goal of these distributed tracing efforts has been to profile requests as they move across service boundaries, generating high-quality data about those requests suitable for analysis.

Figure 1

The diagram in Figure 1 represents a sample trace. A trace is a collection of linked spans, which are named and timed operations that represent a unit of work in the request. A span that isn’t the child of any other span is the parent span, or root span, of the trace. The root span describes the end-to-end latency of the entire trace, with child spans representing sub-operations.

To put this in more concrete terms, consider the request flow of a system that you might encounter in the real world, a ridesharing app. When a user requests a ride, multiple actions begin to take place –– information is passed between services in order to authenticate and authorize the user, validate payment information, locate nearby drivers, and dispatch one of them to pick up the rider.

A simplified diagram of this system, and a trace of a request through it, appears in the following figure. As you can see, each operation generates a span to represent the work being done during its execution. These spans have implicit relationships (parent-child) both from the beginning of the entire request at the client, but also from individual services in the trace. Traces are composable in this way: a valid trace consists of valid sub-traces.

Figure 2

Spans

Each span in OpenTelemetry encapsulates several pieces of information, such as the name of the operation it represents, a start and end timestamp, events and attributes that occurred during the span, links to other spans, and the status of the operation. In Figure 1, the dashed lines connecting spans represent the context of the trace. The context (or trace context) contains several pieces of information that can be passed between functions inside a process or between processes over an RPC. In addition to the span context, identifiers that represent the parent trace and span, the context can contain other information about the process or request, like custom labels. As mentioned before, an important feature of spans is that they are able to encapsulate a host of information. Much of this information, such as the operation name and start/stop timestamps, is required — but some is optional. OpenTelemetry offers two data types, Attribute and Event, which are incredibly valuable as they help to contextualize what happens during the execution measured by a single span.

Attributes are key-value pairs that can be freely added to a span to help in analysis of the trace data. You can think of Attributes as data that you would like to eventually aggregate or use to filter your trace data, such as a customer identifier, process hostname, or anything else that fits your tracing needs. Events are time-stamped strings that can be attached to a span, with an optional set of Attributes that provide further description. OpenTelemetry additionally provides a set of semantic conventions of reserved attributes and events for operation or protocol specific information. Spans in OpenTelemetry are generated by the Tracer, an object that tracks the currently active span and allows you to create (or activate) new spans.

Tracer objects are configured with Propagator objects that support transferring the context across process boundaries. The exact mechanism for creating and registering a Tracer is dependent on your implementation and language, but you can generally expect there to be a global Tracer capable of providing a default tracer for your spans, and/or a Tracer provider capable of granting access to the tracer for your component. As spans are created and completed, the Tracer dispatches them to the OpenTelemetry SDK’s Exporter, which is responsible for sending your spans to a backend system for analysis.