Tracing

Resources

Telemetry data is indexed by service. In OpenTelemetry, services are described by resources, which are set when the OpenTelemetry SDK is initialized during program startup. We want our data to be normalized, so we can compare apples to apples. OpenTelemetry defines a schema for the keys and values which describe common service resources such as hostname, region, version, etc. These standards are called Semantic Conventions, and are defined in the OpenTelemetry Specification.

We recommend that, at minimum, the following resources be applied to every service:

AttributeDescriptionExampleRequired?
service.nameLogical name of the service.
MUST be the same for all instances of horizontally scaled services.
shoppingcartYes
service.versionThe version string of the service API or implementation as defined in Version Attributes.semver:2.0.0No
host.hostnameContains what the hostname command would return on the host machine.server1.mydomain.com,No

Python Configuration

In python, resources are set on the TracerProvider` at program startup.

   # example of setting resource labels
   def set_resource_labels(self):
       labels = {
           "service.name": "service123",
           "service.version": "1.2.3",
           "hostname": platform.node(),
       }
       r = Resource(labels=labels)
       trace.get_tracer_provider().resource = r.merge(
           trace.get_tracer_provider().resource
       )

Semantic Conventions

Standardizing the format of your data is a critical part of using OpenTelemetry. OpenTelemetry provides a schema for describing common resources, so that backends can easily parse and identify relevant information.

It is important to understand these conventions when writing instrumentation, in order to normalize your data and increase its utility. The semantic conventions for resources can be found in the specification.

The following types of resources are currently defined:

Spans

OpenTelemetry comes with many instrumentation plugins for libraries and frameworks. This should be enough detail to get started with tracing in production.

As great as that is, you will still want to add additional spans to your application code, in order to break down larger operations and gain more detailed insights into where your application is spending its time.

When you create a new span to measure a subcomponent, that span is added to the current trace as the child of the current span, and then becomes the current span itself.

Tracer API

Accessing the tracer

In order to interact with traces, you must first acquire a handle to a Tracer.

By convention, Tracers are named after the component they are instrumenting; usually a library, a package, or a class.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

Note that there is no need to "set" a tracer by name before getting it. The name you provide is to help identify which component generated which spans, and to potentially disable tracing for individual components.

We recommend calling getTracer once per component during initialization and retaining a handle to the tracer, rather than calling getTracer repeatedly.

Accessing the current span

Ideally, when tracing application code, spans are created and managed in the application framework.

Assuming that your application framework is supported, a trace will automatically be created for each request, and your application code will already be wrapped in a span, which can be used for adding application specific attributes and events.

To access the currently active span, call getCurrentSpan

from opentelemetry import trace

span = trace.get_current_span()

Setting a new current span

Let’s demonstrate creating a new span by example. Imagine you have an automated kitchen, and you want to time how long the robot chef takes to bake a cake. The naive way to do this would be to just start a span, call your method, then end the span:

# make a child span to measure how long it takes to bake a cake.
span = trace.get_current_span()

cake_span = self._tracer.start_span(name="bake-cake", parent=span)
chef.bake_cake()
cake_span.end()

The above example will work just fine, but with one big problem: the bakeCake method itself has no access to this new bake-cake span. That means there would be no way to add attributes and events to this span from within the bakeCake method. Even worse, get_current_span would return the parent of “bake-cake,” since that span is still set as current.

What should we do instead? Replace the current span with bake-cake. To do this, call withSpan to make a closure around the bakeCake method. Within this closure, the getCurrentSpan method will now return bake-cake.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

# Replace the current active span with a new child span
with tracer.start_as_current_span("bake-cake"):
  # now returns the bake-cake span
  trace.get_current_span()
  # bake your cake!
  chef.bake_cake()

This pattern of wrapping method calls is important, because we always want application code to be able to assume that the current span is correct.

Attributes

When performing root cause analysis, span attributes are an important tool for pinpointing the source of performance issues.

Setting Attributes

Note that it is only possible to set attributes, not to get them.

Much like how resources are used to describe your services, attributes are used to describe your spans. Here is an example of setting attributes to correctly define an HTTP client request:

span = tracer.start_span(
  "/project/:project-id/list",
  kind=SpanKind.CLIENT,
  attributes={
    "http.method": "GET",
    "http.flavor": "1.1",
    "http.url": "https://example.com:8080/project/123/list/?page=2",
    "net.peer.ip": "192.0.2.5",
    "http.status_code": 200,
    "http.status_text": "OK"
  },
)

# In addition to the standard attributes, custom attributes can be added as well.
span.set_attribute("list.page_number", 2);

// To avoid collisions, always namespace your attribute keys using dot notation.
span.set_attribute("project.id", 2);

// attributes can be added to a span at any time before the span is finished.
span.end()

Conventions

Spans represent specific operations in and between systems. Many operations represent well-known protocols like HTTP or database calls. Like with resources, OpenTelemetry defines a schema for the attributes which describe these common operations. These standards are called Semantic Conventions, and are defined in the OpenTelemetry Specification.

OpenTelemetry provides a schema for describing common attributes so that backends can easily parse and identify relevant information. It is important to understand these conventions when writing instrumentation, in order to normalize your data and increase its utility.

The following semantic conventions are defined for tracing:

  • General: General semantic attributes that may be used in describing different kinds of operations.
  • HTTP: Spans for HTTP client and server.
  • Database: Spans for SQL and NoSQL client calls.
  • RPC/RMI: Spans for remote procedure calls (e.g., gRPC).
  • Messaging: Spans for interaction with messaging systems (queues, publish/subscribe, etc.).
  • FaaS: Spans for Function as a Service (e.g., AWS Lambda).

Events

The finest-grained tracing tool is the event system.

Span events are a form of structured logging. Each event has a name, a timestamp, and a set of attributes. When events are added to a span, they inherit the span's context. This additional context allows events to be searched, filtered, and grouped by trace ID and other span attributes.

Span context is one of the key differences between distributed tracing and traditional logging.

Adding events

Events are automatically timestamped when they are added to a span. Timestamps can also be set manually if the events are being added after the fact.

For example, enqueuing an item might be recorded as an event.

from opentelemetry import trace

# Get the current span
span = trace.get_current_span()

# Perform the action
queue.enqueue(item)

# Record the action
span.add_event( “enqueued item“, {
  "item.id", item.id(),
	"queue.id": queue.id(),
	"queue.length": queue.Length(),
})

Spans should be created for recording course-grained operations, and events should be created for recording fine-grained operations.

Recording exceptions

Many of the tracing conventions can apply to event attributes as well as span attributes. The most important event-specific convention is recording exceptions.

from opentelemetry import trace
from opentelemetry.trace.status import StatusCanonicalCode

span = trace.get_current_span()

# recordException converts the exception into a span event. 
span.record_exception(exception)

# If the exception means the operation results in an 
# error state, you can also use it to update the span status.
span.set_status(StatusCanonicalCode.INTERNAL)

Marking the span as an error is independent from recordings exceptions. To mark the entire span as an error, and have it count against error rates, set the SpanStatus to any value other than OK.

StatusCode definitions can be found in the OpenTelemetry specification. If no status code directly maps to the type of error you are recording, set the status code to UNKOWN for common errors, and INTERNAL for serious errors.