Distributed Tracing with OpenTelemetry
Distributed tracing in Flyte provides visibility into the execution flow across multiple services, such as the Executor, Admin, and Data Proxy. By integrating with OpenTelemetry (OTEL), Flyte allows you to track requests as they move through the system, helping identify bottlenecks and debug failures in complex workflows.
Configuring Exporters
Flyte supports several OpenTelemetry exporters, including Jaeger, OTLP (gRPC and HTTP), and a file-based exporter for local debugging. You configure these via the otel section in your Flyte configuration.
The otel.type setting determines which exporter is used. Supported values are noop, file, jaeger, otlpgrpc, and otlphttp.
otel:
type: "otlpgrpc"
otlpgrpc:
endpoint: "http://otel-collector.tracing.svc.cluster.local:4317"
sampler:
parentSampler: "traceid"
traceIdRatio: 0.1
The configuration is defined in flytestdlib/otelutils/config.go within the Config struct. Each exporter has its own configuration block:
- Jaeger: Uses the
otel.jaeger.endpoint(default:http://localhost:14268/api/traces). - OTLP gRPC: Uses the
otel.otlpgrpc.endpoint(default:http://localhost:4317). - OTLP HTTP: Uses the
otel.otlphttp.endpoint(default:http://localhost:4318/v1/traces). - File: Writes traces to a local file specified by
otel.file.filename(default:/tmp/trace.txt).
Service Initialization
Flyte services initialize their tracing and metrics providers during startup using otelutils.RegisterProvidersWithContext. This function sets up the global tracer and meter providers for a specific service name.
In executor/setup.go, the initialization looks like this:
otelCfg := otelutils.GetConfig()
if err := otelutils.RegisterProvidersWithContext(ctx, otelServiceName, otelCfg); err != nil {
return fmt.Errorf("registering otel providers: %w", err)
}
Internally, RegisterProvidersWithContext (found in flytestdlib/otelutils/factory.go) creates the appropriate exporter based on the configuration and stores the resulting TracerProvider in a global map keyed by the service name. This allows different components within the same process to use distinct providers if necessary.
Propagating Traces via ConnectRPC
Flyte uses ConnectRPC for communication between many of its services. To ensure trace context is propagated across these network boundaries, Flyte utilizes otelconnect interceptors.
When creating a client, you pass the interceptor to the client constructor. For example, in executor/setup.go, the EventsProxyServiceClient is configured with an OTEL interceptor:
otelInterceptor, err := otelconnect.NewInterceptor(
otelconnect.WithTracerProvider(otelutils.GetTracerProvider(otelServiceName)),
otelconnect.WithMeterProvider(otelutils.GetMeterProvider(otelServiceName)),
otelconnect.WithoutServerPeerAttributes(),
)
eventsClient := workflowconnect.NewEventsProxyServiceClient(
http.DefaultClient,
eventsServiceURL,
connect.WithInterceptors(otelInterceptor),
)
This interceptor automatically extracts trace headers from incoming requests and injects them into outgoing requests, maintaining the distributed trace across service calls.
Manual Instrumentation
For internal logic that isn't covered by RPC interceptors, you can create manual spans using otelutils.NewSpan. This helper function is designed to integrate with Flyte's logging system by automatically injecting log fields from the context as span attributes.
In flytestdlib/storage/cached_rawstore.go, manual spans are used to track storage operations:
func (s *cachedRawStore) Head(ctx context.Context, reference storage.DataReference) (storage.Metadata, error) {
ctx, span := otelutils.NewSpan(ctx, otelutils.BlobstoreClientTracer, "flytestdlib.storage.cachedRawStore/Head")
defer span.End()
return s.RawStore.Head(ctx, reference)
}
The NewSpan function (in flytestdlib/otelutils/factory.go) iterates through log fields retrieved via contextutils.GetLogFields(ctx) and adds them to the span. This ensures that if your context already contains metadata like project, domain, or workflow_id, those values are automatically searchable in your tracing backend.
Kubernetes API Visibility
Flyte interacts heavily with the Kubernetes API. To gain visibility into these interactions, Flyte provides wrappers for the controller-runtime client and cache.
The otelutils.WrapK8sClient and otelutils.WrapK8sCache functions (defined in flytestdlib/otelutils/k8s.go) wrap standard Kubernetes clients to create spans for every Get, List, Create, Update, and Delete operation.
// Example of wrapping a K8s client
kubeClient = otelutils.WrapK8sClient(kubeClient)
These spans are tagged with the k8s-client tracer name and include the specific operation path (e.g., controller-runtime.pkg.client.Client/Get), allowing you to see exactly how much time is spent waiting on the Kubernetes API server.
Sampling Strategies
To manage the volume of trace data, Flyte supports two sampling strategies configured via otel.sampler.parentSampler:
always: Every request is sampled. This is useful for development or low-traffic environments.traceid: Uses probabilistic sampling based on a ratio. The ratio is controlled byotel.sampler.traceIdRatio.
Both strategies are wrapped in a ParentBased sampler. This means that if an upstream service (like a proxy or another Flyte service) has already decided to sample a request, Flyte will respect that decision regardless of its local ratio.