Internal Architecture of the Cache Service
The Flyte Cache Service is structured into two primary layers: a thin transport-agnostic service layer and a core orchestration manager. This separation ensures that the logic for cache lifecycle management—such as concurrent population coordination and metadata merging—remains decoupled from the specific RPC protocols used for communication.
The Service Layer: API and Validation
The CacheService class in cache_service/service/service.go serves as the entry point for all external requests. It implements the ConnectRPC service interface and is responsible for translating transport-specific identifiers into internal scoped keys.
Request Validation
To ensure data integrity, the service layer utilizes a validatableRequest interface. Every incoming protobuf message that implements a Validate() method is checked before any processing occurs:
type validatableRequest interface {
Validate() error
}
func validateRequest(msg validatableRequest) error {
if err := msg.Validate(); err != nil {
return connect.NewError(connect.CodeInvalidArgument, err)
}
return nil
}
Key Scoping
Flyte scopes cache entries by project and domain to prevent key collisions across different organizational boundaries. The scopedKey function in cache_service/service/service.go generates these internal keys:
func scopedKey(key string, id *cacheservicev2.Identifier) string {
return fmt.Sprintf("%s-%s-%s", id.GetProject(), id.GetDomain(), key)
}
The Manager: Core Orchestration
The Manager class in cache_service/manager/manager.go is the "brain" of the service. It coordinates between the API layer and the underlying persistence repositories (CachedOutputRepo and ReservationRepo).
Cache Entry Lifecycle
When a cache entry is retrieved, the Manager returns a CacheEntry DTO, which encapsulates the location of the data and its associated metadata:
type CacheEntry struct {
OutputURI string
Metadata *cacheservicepb.Metadata
}
A critical design choice in Flyte is that the Cache Service only persists the OutputURI and metadata. The actual data payload resides in external object storage (like S3 or GCS). This keeps the cache service lightweight and avoids the overhead of proxying large data blobs.
Metadata Merging
When updating a cache entry via Put, the Manager performs a merge operation. It preserves the original CreatedAt timestamp while updating the LastUpdatedAt field, ensuring that the history of the cache entry is maintained even as its content is overwritten.
The Reservation System
To prevent the "thundering herd" problem—where multiple workers attempt to compute and cache the same result simultaneously—Flyte implements a reservation system within the Manager.
Coordination via Heartbeats
The GetOrExtendReservation method allows a worker to claim a "lock" on a cache key. If a reservation already exists and is held by a different active owner, the Manager returns the current holder's information, signaling to the caller that they should wait.
The system uses a heartbeat mechanism to detect failed workers. A reservation is considered expired only after a grace period, calculated as:
ExpiresAt = Now + (HeartbeatInterval * heartbeatGracePeriodMultiplier)
The heartbeatGracePeriodMultiplier (defaulting to 3) provides a buffer for network jitter or temporary worker stalls.
reservation := &models.Reservation{
Key: reservationKey,
OwnerID: request.GetOwnerId(),
HeartbeatSeconds: int64(heartbeat.Seconds()),
ExpiresAt: now.Add(heartbeat * time.Duration(m.heartbeatGracePeriodMultiplier)),
}
Idempotent Release
Workers release their reservations via ReleaseReservation once they have successfully populated the cache. The implementation is designed to be idempotent; if a reservation is missing (perhaps already released or expired), the service treats the request as successful to simplify client-side cleanup logic.
Design Tradeoffs and Constraints
The architecture of the Flyte Cache Service reflects several specific design tradeoffs:
- Decoupled Storage: By storing only URIs, Flyte avoids the complexity of managing a high-throughput blob store. However, this places the responsibility of data lifecycle management on the client or external storage policies. For instance, the
Deletemethod inManagerexplicitly does not remove the underlying object-storage blob. - Serialized Population: The reservation system prioritizes reducing redundant computation over absolute low-latency access during a cache miss. This is ideal for heavy Flyte tasks where the cost of re-computation far outweighs the cost of a few coordination RPCs.
- Thin Service Layer: The
CacheServiceis intentionally thin, delegating all policy decisions (like when a reservation is "claimable") to theManager. This makes the core logic easier to test in isolation without mocking the entire RPC stack.