Cache Key Generation and Metadata
Flyte implements a sophisticated caching mechanism that ensures task execution results are reused efficiently while maintaining strict data integrity. This system relies on the generation of deterministic cache keys and the preservation of execution metadata to track the provenance of cached artifacts.
The Anatomy of a Cache Key
In Flyte, a cache key is not a simple hash of a task name. Instead, it is a composite identifier that encapsulates the task's identity, its interface signature, and the specific input values used during execution. This logic is primarily encapsulated in the catalog.Key struct found in flyteplugins/go/tasks/pluginmachinery/catalog/client.go.
type Key struct {
Identifier *core.Identifier
CacheVersion string
CacheIgnoreInputVars []string
TypedInterface *core.TypedInterface
InputReader io.InputReader
CacheKey string
}
The final string representation of a cache key is generated by concatenating four distinct hashes, as seen in the buildCacheKey function within flyteplugins/go/tasks/pluginmachinery/catalog/cache_service/client.go:
- Identifier Hash: Generated via
catalog.HashIdentifierExceptVersion. Flyte intentionally excludes the version of the task or launch plan from this hash. This design choice allows for cache hits across different versions of the same resource, provided the interface and inputs remain identical. - Interface Signature Hash: Generated via
generateInterfaceSignatureHash. This ensures that if a task's input or output types change, the cache key will change, preventing type mismatches when retrieving cached data. - Inputs Hash: Generated via
hashInputs. This is a hash of the actual literal values passed to the task. - Cache Version: A user-defined string (often provided in the task decorator) that allows developers to manually invalidate the cache by changing the version string.
Input Hashing and Optimization
The hashing of inputs is performed by catalog.HashLiteralMap in flyteplugins/go/tasks/pluginmachinery/catalog/hashing.go. A notable optimization in this process is the "hashify" step. If a core.Literal already contains a pre-computed hash, Flyte uses that hash instead of re-processing the entire value.
func hashify(literal *core.Literal) *core.Literal {
if literal.GetHash() != "" {
return &core.Literal{
Hash: literal.GetHash(),
}
}
// Recursive cases for collections and maps...
}
Additionally, the CacheIgnoreInputVars field in the Key struct allows specific input variables to be excluded from the hash calculation. This is useful for inputs that do not affect the task's output, such as logging levels or non-deterministic metadata.
Execution Metadata and Provenance
While the cache key identifies what was executed, the catalog.Metadata struct captures the who, where, and when. This metadata is stored alongside the cached output in the catalog to provide a clear audit trail of the data's origin.
type Metadata struct {
WorkflowExecutionIdentifier *core.WorkflowExecutionIdentifier
NodeExecutionIdentifier *core.NodeExecutionIdentifier
TaskExecutionIdentifier *core.TaskExecutionIdentifier
CreatedAt *timestamppb.Timestamp
}
When Flyte populates the cache after a successful execution, it maps these identifiers to a metadata map. In flyteplugins/go/tasks/pluginmachinery/catalog/cache_service/client.go, the newCacheMetadata function demonstrates how execution details like the project, domain, and specific task attempt are preserved:
cacheMetadata.KeyMap = &cacheservicepb.KeyMapMetadata{
Values: map[string]string{
TaskVersionKey: metadata.TaskExecutionIdentifier.GetTaskId().GetVersion(),
ExecProjectKey: metadata.TaskExecutionIdentifier.GetNodeExecutionId().GetExecutionId().GetProject(),
ExecDomainKey: metadata.TaskExecutionIdentifier.GetNodeExecutionId().GetExecutionId().GetDomain(),
ExecNameKey: metadata.TaskExecutionIdentifier.GetNodeExecutionId().GetExecutionId().GetName(),
ExecNodeIDKey: metadata.TaskExecutionIdentifier.GetNodeExecutionId().GetNodeId(),
ExecTaskAttemptKey: fmt.Sprintf("%d", metadata.TaskExecutionIdentifier.GetRetryAttempt()),
},
}
This metadata allows the Flyte UI and CLI to show users exactly which execution produced a specific cached result, enhancing transparency in complex workflows.
Asynchronous Cache Persistence
To minimize the impact on task execution latency, Flyte performs cache writes asynchronously. The WriterWorkItem struct in flyteplugins/go/tasks/pluginmachinery/catalog/writer_processor.go encapsulates all the necessary components for a cache store operation: the Key, the OutputReader (containing the actual data), and the Metadata.
type WriterWorkItem struct {
key Key
data io.OutputReader
metadata Metadata
}
These work items are processed by the WriterProcessor within an indexed work queue. This architecture allows the Flyte executor to signal that a task is complete as soon as its outputs are uploaded to storage, while the catalog registration happens in the background.
Configuration and Constraints
Flyte provides several configuration options to tune cache behavior via CacheKeyConfig and ReaderWorkqueueConfig/WriterWorkqueueConfig.
- EnforceExecutionProjectDomain: When enabled, this ensures that cache keys are computed based on the project and domain of the current execution, even if the task being executed is defined in a different project. This prevents cross-project cache pollution in multi-tenant environments.
- MaxCacheAge: Enforced client-side during cache lookups. If a cached entry is older than this duration, Flyte treats it as a cache miss, ensuring that stale data is eventually refreshed.
- Workqueue Retries: Both the reader and writer queues support configurable retries (defaulting to 3) to handle transient failures in the catalog service.
By combining deterministic hashing with rich provenance metadata and asynchronous processing, Flyte provides a robust caching layer that balances performance with the strict requirements of reproducible data science workflows.