Cache Key Generation and Metadata

Flyte implements a sophisticated caching mechanism that ensures task execution results are reused efficiently while maintaining strict data integrity. This system relies on the generation of deterministic cache keys and the preservation of execution metadata to track the provenance of cached artifacts.

The Anatomy of a Cache Key

In Flyte, a cache key is not a simple hash of a task name. Instead, it is a composite identifier that encapsulates the task's identity, its interface signature, and the specific input values used during execution. This logic is primarily encapsulated in the catalog.Key struct found in flyteplugins/go/tasks/pluginmachinery/catalog/client.go.

type Key struct {
	Identifier           *core.Identifier
	CacheVersion         string
	CacheIgnoreInputVars []string
	TypedInterface       *core.TypedInterface
	InputReader          io.InputReader
	CacheKey             string
}

The final string representation of a cache key is generated by concatenating four distinct hashes, as seen in the buildCacheKey function within flyteplugins/go/tasks/pluginmachinery/catalog/cache_service/client.go:

Identifier Hash: Generated via catalog.HashIdentifierExceptVersion. Flyte intentionally excludes the version of the task or launch plan from this hash. This design choice allows for cache hits across different versions of the same resource, provided the interface and inputs remain identical.
Interface Signature Hash: Generated via generateInterfaceSignatureHash. This ensures that if a task's input or output types change, the cache key will change, preventing type mismatches when retrieving cached data.
Inputs Hash: Generated via hashInputs. This is a hash of the actual literal values passed to the task.
Cache Version: A user-defined string (often provided in the task decorator) that allows developers to manually invalidate the cache by changing the version string.

Input Hashing and Optimization

The hashing of inputs is performed by catalog.HashLiteralMap in flyteplugins/go/tasks/pluginmachinery/catalog/hashing.go. A notable optimization in this process is the "hashify" step. If a core.Literal already contains a pre-computed hash, Flyte uses that hash instead of re-processing the entire value.

func hashify(literal *core.Literal) *core.Literal {
	if literal.GetHash() != "" {
		return &core.Literal{
			Hash: literal.GetHash(),
		}
	}
    // Recursive cases for collections and maps...
}

Additionally, the CacheIgnoreInputVars field in the Key struct allows specific input variables to be excluded from the hash calculation. This is useful for inputs that do not affect the task's output, such as logging levels or non-deterministic metadata.

Execution Metadata and Provenance

While the cache key identifies what was executed, the catalog.Metadata struct captures the who, where, and when. This metadata is stored alongside the cached output in the catalog to provide a clear audit trail of the data's origin.

type Metadata struct {
	WorkflowExecutionIdentifier *core.WorkflowExecutionIdentifier
	NodeExecutionIdentifier     *core.NodeExecutionIdentifier
	TaskExecutionIdentifier     *core.TaskExecutionIdentifier
	CreatedAt                   *timestamppb.Timestamp
}

When Flyte populates the cache after a successful execution, it maps these identifiers to a metadata map. In flyteplugins/go/tasks/pluginmachinery/catalog/cache_service/client.go, the newCacheMetadata function demonstrates how execution details like the project, domain, and specific task attempt are preserved:

cacheMetadata.KeyMap = &cacheservicepb.KeyMapMetadata{
    Values: map[string]string{
        TaskVersionKey:     metadata.TaskExecutionIdentifier.GetTaskId().GetVersion(),
        ExecProjectKey:     metadata.TaskExecutionIdentifier.GetNodeExecutionId().GetExecutionId().GetProject(),
        ExecDomainKey:      metadata.TaskExecutionIdentifier.GetNodeExecutionId().GetExecutionId().GetDomain(),
        ExecNameKey:        metadata.TaskExecutionIdentifier.GetNodeExecutionId().GetExecutionId().GetName(),
        ExecNodeIDKey:      metadata.TaskExecutionIdentifier.GetNodeExecutionId().GetNodeId(),
        ExecTaskAttemptKey: fmt.Sprintf("%d", metadata.TaskExecutionIdentifier.GetRetryAttempt()),
    },
}

This metadata allows the Flyte UI and CLI to show users exactly which execution produced a specific cached result, enhancing transparency in complex workflows.

Asynchronous Cache Persistence

To minimize the impact on task execution latency, Flyte performs cache writes asynchronously. The WriterWorkItem struct in flyteplugins/go/tasks/pluginmachinery/catalog/writer_processor.go encapsulates all the necessary components for a cache store operation: the Key, the OutputReader (containing the actual data), and the Metadata.

type WriterWorkItem struct {
	key      Key
	data     io.OutputReader
	metadata Metadata
}

These work items are processed by the WriterProcessor within an indexed work queue. This architecture allows the Flyte executor to signal that a task is complete as soon as its outputs are uploaded to storage, while the catalog registration happens in the background.

Configuration and Constraints

Flyte provides several configuration options to tune cache behavior via CacheKeyConfig and ReaderWorkqueueConfig/WriterWorkqueueConfig.

EnforceExecutionProjectDomain: When enabled, this ensures that cache keys are computed based on the project and domain of the current execution, even if the task being executed is defined in a different project. This prevents cross-project cache pollution in multi-tenant environments.
MaxCacheAge: Enforced client-side during cache lookups. If a cached entry is older than this duration, Flyte treats it as a cache miss, ensuring that stale data is eventually refreshed.
Workqueue Retries: Both the reader and writer queues support configurable retries (defaulting to 3) to handle transient failures in the catalog service.

By combining deterministic hashing with rich provenance metadata and asynchronous processing, Flyte provides a robust caching layer that balances performance with the strict requirements of reproducible data science workflows.

The Anatomy of a Cache Key​

Input Hashing and Optimization​

Execution Metadata and Provenance​

Asynchronous Cache Persistence​

Configuration and Constraints​

The Anatomy of a Cache Key

Input Hashing and Optimization

Execution Metadata and Provenance

Asynchronous Cache Persistence

Configuration and Constraints