Distributed Tracing for Distributed System: Save Your Time & Company
Nowadays, most of you have already heard about microservices (or other distributed systems) and their benefits.
But, building microservices or other distributed systems without proper monitoring/observability tools to measure the performance and know the behavior of your app can be challenging.
Not only that, if there is a bottleneck issue in a production it’s probably hard to identify which part causes the bottleneck.
For example, there is an application for ordering food with several services like order, kitchen, and delivery services. When the user wants to order the food the user gets stuck and sees the loading on the app which can be frustrating for them, with distributed tracing, it can be easier to point out, which services are slow and how slow it is.
However, by implementing distributed tracing on your system, a comprehensive dashboard unveils the exact flow of transactions. Engineers can swiftly pinpoint the bottleneck, drastically reducing the time for diagnostics. This not only ensures a seamless user experience but also safeguards the company’s reputation, preventing potential revenue loss due to cart abandonment. Having distributed tracing is like having a detective for your app
What is a Distributed System?
Okay, before continuing let’s have a basic understanding of what is distributed system is.
A distributed system is a system that consists of multiple components, it’s integrated and working together to achieve their goal.
For example, an application for ordering food, as I mentioned before, for ordering food it’s involves multiple services/apps to achieve like order, kitchen, delivery, and user service (it can be different for every company/use case).
Another example is, imagine you have a monolith e-commerce app and the app has a feature for sending an email for every successful transaction to the user, and another app does that by calling in an async way from your monolith app. This kind also counted as a distributed system
What is Tracing and Distributed Tracing?
Another concept that you must understand is tracing and distributed tracing, so what’s the difference and the purpose?
So, tracing is a part of monitoring or observing a system to trace a request or transaction and can be used to know about the performance and behavior of your system.
Distributed tracing, this one is the same as tracing but we perform it on a distributed system.
The main difference is when you enable distributed tracing on your distributed system, if the request/transaction involves multiple services/components, it’s counted as one, which means you will know end-to-end performance and behavior for specific requests/flow, which can be useful for debugging and analyzing distributed system.
How it’s works?
Let's start with some terminology in tracing.
Trace, trace represent a single transaction or workflow in your system / your distributed system. Trace consists of multiple spans. This is how you do it in Go.
// Set up trace provider.
tracerProvider, err := newTraceProvider()
if err != nil {
handleErr(err)
return
}
shutdownFuncs = append(shutdownFuncs, tracerProvider.Shutdown)
otel.SetTracerProvider(tracerProvider)
Span, span represents a single operation within a trace. Span has start and end time, in my experience, span also represents the function, so for every function, we create a span to measure how long the function runs. This is how you implement the span in Go with manual instrumentation.
func (o OrderHandler) placeOrderHandler(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "placeOrderHandler")
defer span.End()
if r.Method != http.MethodPost {
Trace Context, Trace context carries about trace and is associated with the spans. In distributed tracing, we propagate this to other services. So this is how the distributed tracing works.
here is how we propagate the trace in HTTP header when directly call other service
r, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
if err != nil {
return err
}
r.Header.Set("Content-Type", "application/json")
propagator := propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
)
propagator.Inject(ctx, propagation.HeaderCarrier(r.Header))
client := &http.Client{}
resp, err := client.Do(r)
here is how we propagate the trace in AMQP header when we use a message broker (rabbitmq in this case) for communication between services.
propagator := propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
)
headersCarrier := amqp.Table{}
propagator.Inject(ctx, AMQPHeaderCarrier{headersCarrier})
ch.Publish("", QueueName, false, false, amqp.Publishing{
ContentType: "application/json",
Body: body,
Headers: headersCarrier,
})
Instrumentation, it involves adding code to application components for generating span and propagating it. it can be done automatically or manually (I showed you in manual way before, we have more control). This instrumentation captures timing and contextual data, we can even add the payload when the error happens, so it make debugging less effort.
span.SetStatus(codes.Error, errorMessage)
span.RecordError(errors.New(errorMessage))
Distributed Tracking System, this is the infrastructure that collect, store and analyze distributed traces. It provides tools for visualizing traces, identifying performance bottlenecks, and understanding the behavior of distributed systems, example jaeger, elastic apm.
Sampling, this is the process of deciding which traces to record and which to discard.
Contextual Data, this is additional information that commonly you can add into span, like logs, metrics, tags, etc.
Conclusions
- Distributed tracing can be very helpful to measure the performance and behavior of your app.
- Having distributed tracing in your distributed system is like having a detective, that’s means when something shit happens, the debugging part can be less effort and engineers can figure out what happening quickly and fix it asap.
- The main different between tracing and distributed tracing is, we propagate the trace to other services.
For full implementations you can found on these repositories: