Maintenance and Resource Cleanup
Flyte manages the lifecycle of TaskAction resources by automatically cleaning up terminal records and providing detailed observability into the controller's performance. This ensures the Kubernetes API server remains performant even under high execution volumes.
Configuring the Garbage Collector
The GarbageCollector in Flyte periodically scans for TaskAction resources that have reached a terminal state (Succeeded, Failed, or Aborted) and deletes them once they exceed a configured Time-to-Live (TTL).
Enable and Configure TTL
To enable the garbage collector, you must provide a non-zero interval and a positive maxTTL when initializing the GarbageCollector in the executor setup.
// Example initialization (typically found in executor/setup.go)
gc := controller.NewGarbageCollector(
mgr.GetClient(),
30 * time.Minute, // interval: how often the GC runs
1 * time.Hour, // maxTTL: how long to keep terminal TaskActions
)
// Add to controller-runtime manager
if err := mgr.Add(gc); err != nil {
return err
}
The GarbageCollector implements the manager.Runnable interface, allowing it to start automatically with the controller manager.
How Discovery Works
The GarbageCollector identifies candidates for deletion using specific Kubernetes labels stamped by the TaskActionReconciler. It uses a paginated list approach to minimize memory pressure.
// From executor/pkg/controller/garbage_collector.go
func (gc *GarbageCollector) collect(ctx context.Context) error {
cutoff := time.Now().UTC().Add(-gc.maxTTL).Format(labelTimeFormat)
listOpts := []client.ListOption{
client.MatchingLabels{LabelTerminationStatus: LabelValueTerminated},
client.HasLabels{LabelCompletedTime},
client.Limit(gcPageSize), // Default is 500
}
// ... lists and deletes expired TaskActions ...
}
The discovery relies on these constants defined in executor/pkg/controller/taskaction_controller.go:
flyte.org/termination-status: Set toterminatedwhen a task finishes.flyte.org/completed-time: The UTC timestamp of completion using the format2006-01-02.15-04.
Because the labelTimeFormat uses minute precision and is lexicographically ordered, the GarbageCollector can efficiently compare strings to determine if a resource has expired:
if completedTime < cutoff {
if err := gc.client.Delete(ctx, ta); err != nil {
// handle error
}
}
Monitoring Execution Metrics
Flyte uses OpenTelemetry (OTel) to track the health and throughput of the TaskAction controller. These metrics are defined in the taskActionMetrics struct and registered during executor startup.
Key Metrics
| Metric Name | Type | Description | Labels |
|---|---|---|---|
taskaction.active | Gauge | Number of TaskAction CRDs currently in the system. | phase (e.g., Succeeded, Running) |
taskaction.crd.size_bytes | Histogram | Serialized JSON size of the TaskAction CRD. | N/A |
taskaction.k8s.duration | Histogram | Latency of Kubernetes API operations. | op (get, update, status_update), error (bool) |
Monitoring API Performance
The taskaction.k8s.duration metric is particularly useful for identifying bottlenecks in the Kubernetes API server or network latency. It is recorded inline during reconciliation:
// From executor/pkg/controller/metrics.go
func (m *taskActionMetrics) recordK8sOp(ctx context.Context, op string, start time.Time, err error) {
if m == nil || m.crdOpDuration == nil {
return
}
m.crdOpDuration.Record(ctx, float64(time.Since(start).Microseconds())/1000.0,
metric.WithAttributes(
attribute.String("op", op),
attribute.Bool("error", err != nil),
),
)
}
Tracking Task Phases
The taskaction.active gauge provides a real-time view of the distribution of tasks across different plugin phases. To keep this metric efficient, Flyte reads directly from the controller's informer cache indexer, avoiding expensive deep-copies of the CRDs.
// From executor/pkg/controller/metrics.go
func countByPhase(items []*flyteorgv1.TaskAction) map[string]int64 {
counts := make(map[string]int64, 8)
for _, ta := range items {
phase := ta.Status.PluginPhase
if phase == "" {
phase = "Unknown"
}
counts[phase]++
}
return counts
}
Operational Tips
Disabling Garbage Collection
If you need to preserve terminal TaskAction resources indefinitely for debugging, set the interval to 0. This prevents the GarbageCollector loop from starting.
Handling Large CRDs
If you observe high values for taskaction.crd.size_bytes, it may indicate that task metadata or plugin-specific state is becoming excessively large. This can lead to increased taskaction.k8s.duration during update and status_update operations.
TTL Precision
The GarbageCollector uses minute-level precision for its cutoff calculation. If you require sub-minute cleanup, the current implementation will not support it due to the labelTimeFormat (2006-01-02.15-04) used for indexing.