Skip to main content

Maintenance and Resource Cleanup

Flyte manages the lifecycle of TaskAction resources by automatically cleaning up terminal records and providing detailed observability into the controller's performance. This ensures the Kubernetes API server remains performant even under high execution volumes.

Configuring the Garbage Collector

The GarbageCollector in Flyte periodically scans for TaskAction resources that have reached a terminal state (Succeeded, Failed, or Aborted) and deletes them once they exceed a configured Time-to-Live (TTL).

Enable and Configure TTL

To enable the garbage collector, you must provide a non-zero interval and a positive maxTTL when initializing the GarbageCollector in the executor setup.

// Example initialization (typically found in executor/setup.go)
gc := controller.NewGarbageCollector(
mgr.GetClient(),
30 * time.Minute, // interval: how often the GC runs
1 * time.Hour, // maxTTL: how long to keep terminal TaskActions
)

// Add to controller-runtime manager
if err := mgr.Add(gc); err != nil {
return err
}

The GarbageCollector implements the manager.Runnable interface, allowing it to start automatically with the controller manager.

How Discovery Works

The GarbageCollector identifies candidates for deletion using specific Kubernetes labels stamped by the TaskActionReconciler. It uses a paginated list approach to minimize memory pressure.

// From executor/pkg/controller/garbage_collector.go
func (gc *GarbageCollector) collect(ctx context.Context) error {
cutoff := time.Now().UTC().Add(-gc.maxTTL).Format(labelTimeFormat)

listOpts := []client.ListOption{
client.MatchingLabels{LabelTerminationStatus: LabelValueTerminated},
client.HasLabels{LabelCompletedTime},
client.Limit(gcPageSize), // Default is 500
}
// ... lists and deletes expired TaskActions ...
}

The discovery relies on these constants defined in executor/pkg/controller/taskaction_controller.go:

  • flyte.org/termination-status: Set to terminated when a task finishes.
  • flyte.org/completed-time: The UTC timestamp of completion using the format 2006-01-02.15-04.

Because the labelTimeFormat uses minute precision and is lexicographically ordered, the GarbageCollector can efficiently compare strings to determine if a resource has expired:

if completedTime < cutoff {
if err := gc.client.Delete(ctx, ta); err != nil {
// handle error
}
}

Monitoring Execution Metrics

Flyte uses OpenTelemetry (OTel) to track the health and throughput of the TaskAction controller. These metrics are defined in the taskActionMetrics struct and registered during executor startup.

Key Metrics

Metric NameTypeDescriptionLabels
taskaction.activeGaugeNumber of TaskAction CRDs currently in the system.phase (e.g., Succeeded, Running)
taskaction.crd.size_bytesHistogramSerialized JSON size of the TaskAction CRD.N/A
taskaction.k8s.durationHistogramLatency of Kubernetes API operations.op (get, update, status_update), error (bool)

Monitoring API Performance

The taskaction.k8s.duration metric is particularly useful for identifying bottlenecks in the Kubernetes API server or network latency. It is recorded inline during reconciliation:

// From executor/pkg/controller/metrics.go
func (m *taskActionMetrics) recordK8sOp(ctx context.Context, op string, start time.Time, err error) {
if m == nil || m.crdOpDuration == nil {
return
}
m.crdOpDuration.Record(ctx, float64(time.Since(start).Microseconds())/1000.0,
metric.WithAttributes(
attribute.String("op", op),
attribute.Bool("error", err != nil),
),
)
}

Tracking Task Phases

The taskaction.active gauge provides a real-time view of the distribution of tasks across different plugin phases. To keep this metric efficient, Flyte reads directly from the controller's informer cache indexer, avoiding expensive deep-copies of the CRDs.

// From executor/pkg/controller/metrics.go
func countByPhase(items []*flyteorgv1.TaskAction) map[string]int64 {
counts := make(map[string]int64, 8)
for _, ta := range items {
phase := ta.Status.PluginPhase
if phase == "" {
phase = "Unknown"
}
counts[phase]++
}
return counts
}

Operational Tips

Disabling Garbage Collection

If you need to preserve terminal TaskAction resources indefinitely for debugging, set the interval to 0. This prevents the GarbageCollector loop from starting.

Handling Large CRDs

If you observe high values for taskaction.crd.size_bytes, it may indicate that task metadata or plugin-specific state is becoming excessively large. This can lead to increased taskaction.k8s.duration during update and status_update operations.

TTL Precision

The GarbageCollector uses minute-level precision for its cutoff calculation. If you require sub-minute cleanup, the current implementation will not support it due to the labelTimeFormat (2006-01-02.15-04) used for indexing.