Skip to main content

Storage Observability and Metrics

Flyte provides a comprehensive observability layer for its storage operations, ensuring that data movement between compute resources and object stores (like S3, GCS, or Azure Blob Storage) is transparent and measurable. This instrumentation is primarily implemented in the flytestdlib/storage package and is designed to help operators identify performance bottlenecks, such as high latency in cloud provider APIs or excessive serialization overhead.

Metrics Architecture

The storage observability system is centered around the dataStoreMetrics struct, defined in flytestdlib/storage/rawstores.go. This struct acts as a central hub, aggregating specialized metrics for different layers of the storage stack:

type dataStoreMetrics struct {
cacheMetrics *cacheMetrics
protoMetrics *protoMetrics
copyMetrics *copyMetrics
stowMetrics *stowMetrics
}

When a DataStore is initialized via NewDataStore or NewDataStoreWithContext, it requires a promutils.Scope. This scope is used to prefix all exported metrics, allowing Flyte to distinguish between storage metrics from different components like the executor, dataproxy, or flytecopilot.

Backend Operations (Stow Metrics)

The stowMetrics class in flytestdlib/storage/stow_store.go monitors the interaction with the underlying cloud storage provider. It uses both labeled.StopWatch and labeled.HistogramStopWatch to track latency.

  • StopWatch: Provides a summary of latency (typically in milliseconds).
  • HistogramStopWatch: Provides a distribution of latencies, which is essential for identifying tail latency (p99) issues in cloud environments.

Key metrics include:

  • head: Latency and failures for metadata lookups.
  • read_open: Time to first byte when opening a stream for reading.
  • write: Time to complete a PUT operation.
  • list: Latency for directory/prefix listing.

These metrics use labeled counters, which allow Flyte to attach additional context such as the failure_type (e.g., distinguishing between a 404 Not Found and a 403 Forbidden).

Metadata Caching (Cache Metrics)

Flyte can be configured to use an in-memory metadata cache (powered by freecache) to reduce the number of expensive Head calls to cloud providers. The cacheMetrics in flytestdlib/storage/cached_rawstore.go track the effectiveness of this cache:

  • cache_hit / cache_miss: Counters to calculate the hit ratio.
  • cache_write_error: Tracks failures when attempting to update the local cache.
  • fetch_latency: A StopWatch measuring how long it takes to fetch data from the remote store when a cache miss occurs.

Constraint: The underlying freecache implementation has a size limit where objects larger than 1/1024 of the total max_size_mbs will not be cached. This behavior is transparent to the user but can be observed through higher miss rates for large objects.

Serialization Overhead (Proto Metrics)

Since Flyte frequently stores structured data as Protobuf messages, the protoMetrics in flytestdlib/storage/protobuf_store.go track the overhead of serialization:

  • marshal_time / unmarshal_time: Latency of converting Go structs to/from wire format.
  • marshal_failure / unmarshal_failure: Counters for serialization errors.

This allows developers to distinguish between a slow network (tracked by stowMetrics) and a CPU-bound serialization bottleneck.

Data Transfer (Copy Metrics)

The copyMetrics in flytestdlib/storage/copy_impl.go monitor the CopyRaw operation. A unique aspect of this implementation is ComputeLengthLatency.

In Flyte, if a ReadCloser from a source does not implement the io.Seeker interface, the copyImpl must read the entire content into memory to calculate the length before writing to the destination (a requirement for backends like S3). The time spent performing this buffering is tracked separately to highlight the cost of copying between incompatible storage interfaces.

Configuration and Initialization

Metrics are enabled by passing a valid Prometheus scope during the creation of the DataStore. In a typical Flyte deployment, this happens during the setup of the executor or other services:

func ExampleNewDataStore() {
testScope := promutils.NewTestScope()
ctx := context.Background()

// The scope "exp_new" will prefix all storage metrics
store, err := NewDataStore(&Config{
Type: TypeMemory,
}, testScope.NewSubScope("exp_new"))

if err != nil {
// handle error
}

// Operations on 'store' will now increment Prometheus counters
}

Failure Tracking Tradeoffs

Flyte makes a deliberate distinction in its failure metrics. For example, copyMetrics includes WriteFailureUnrelatedToCache. This design choice ensures that transient errors in the metadata cache do not pollute the metrics for the core storage logic. If a write to the primary storage succeeds but the subsequent cache update fails, it is not counted as a storage failure, preserving the accuracy of reliability SLIs for the underlying cloud provider.

Metrics Reference Table

Metric NameTypeDescription
bad_keyCounterIncorrectly formatted storage references.
headStopWatch/HistogramLatency of metadata (HEAD) requests.
read_openStopWatch/HistogramLatency to open a read stream.
writeStopWatch/HistogramLatency of object writes (PUT).
cache_hitCounterSuccessful metadata cache lookups.
marshal_timeStopWatchTime spent serializing Protobuf messages.
lengthStopWatchTime spent buffering data to compute content length.