Overview
Flyte is an open-source orchestrator designed to manage the lifecycle of machine learning and data processing workflows. It provides a structured way to define, execute, and track complex pipelines across distributed infrastructure, ensuring reliability through built-in retries, state management, and observability.
The Problem: Distributed Chaos
Building ML and data pipelines often involves stitching together disparate tools (Spark, Ray, Kubernetes pods) while handling transient failures, data lineage, and resource scaling. Without a central orchestrator, developers end up writing "glue code" for retries, state persistence, and monitoring, which is brittle and hard to maintain.
Flyte solves this by providing a unified API and a robust execution engine that abstracts away the underlying infrastructure, allowing engineers to focus on business logic rather than distributed systems plumbing.
Core Concepts
- Run: A single execution of a workflow or task. In Flyte's internal model, a Run is simply a "root action" that can spawn child actions.
- Action: The fundamental unit of work. Actions can be individual tasks (like running a container) or complex structures (like a sub-workflow or a conditional branch).
- Attempt: Every action can have multiple attempts. Flyte automatically tracks these attempts, recording their logs, phase transitions (e.g., Queued -> Running -> Succeeded), and any errors.
- Plugin: The execution logic for an action. Flyte uses plugins to interface with different backends, such as Kubernetes Pods, AWS Batch, or specialized engines like Spark and Ray.
- TaskAction CRD: A Kubernetes Custom Resource that acts as the bridge between the Flyte control plane and the execution cluster.
How it Works: The Architecture
Flyte 2 follows a decoupled architecture where state management and execution are handled by specialized services:
- Runs Service: The source of truth. It manages the database (PostgreSQL or SQLite) and exposes a gRPC/Connect API for creating and monitoring runs.
- Manager (Executor): A Kubernetes controller that watches for
TaskActionresources. When a new action is enqueued, the Manager resolves the appropriate plugin and handles the actual execution (e.g., launching a Pod). - Events Proxy: As actions progress in the cluster, the Manager sends real-time events back to the Runs Service to update the global state.
- Offloaded Storage: Large inputs and outputs are stored in object storage (S3/GCS/Minio), with only the metadata and URIs persisted in the database.
Use Cases
Creating a Containerized Run
You can trigger a run by providing a TaskSpec that defines the container image and command to execute.
import (
"github.com/flyteorg/flyte/v2/gen/go/flyteidl2/common"
"github.com/flyteorg/flyte/v2/gen/go/flyteidl2/core"
"github.com/flyteorg/flyte/v2/gen/go/flyteidl2/task"
"github.com/flyteorg/flyte/v2/gen/go/flyteidl2/workflow"
"github.com/flyteorg/flyte/v2/gen/go/flyteidl2/workflow/workflowconnect"
)
// Initialize the client
client := workflowconnect.NewRunServiceClient(http.DefaultClient, "http://localhost:8090")
// Define and create the run
req := &workflow.CreateRunRequest{
Id: &workflow.CreateRunRequest_ProjectId{
ProjectId: &common.ProjectIdentifier{Name: "my-project", Domain: "development"},
},
Task: &workflow.CreateRunRequest_TaskSpec{
TaskSpec: &task.TaskSpec{
TaskTemplate: &core.TaskTemplate{
Target: &core.TaskTemplate_Container{
Container: &core.Container{
Image: "python:3.9",
Args: []string{"python", "-c", "print('Hello Flyte')"},
},
},
},
},
},
}
resp, err := client.CreateRun(ctx, connect.NewRequest(req))
Monitoring Action Progress
Since Flyte tracks every attempt and phase transition, you can query the status of specific actions within a run.
# List actions for a specific run using the CLI
runs-client list-actions --project my-project --domain development --run-name r12345
When to Use Flyte
- Use Flyte when you need to run complex, multi-step ML pipelines that require high reliability, resource isolation, and a clear audit trail of every execution.
- Use Flyte when you want to provide a "Platform as a Service" experience for data scientists, allowing them to run code on K8s without managing YAML.
- Do NOT use Flyte when you have simple, short-lived scripts that don't require retries or distributed execution. A simple cron job or CI/CD runner might be sufficient.
Stack Compatibility
- Primary Language: Go (Services and Manager)
- Runtime Environment: Kubernetes (required for the Manager/Executor)
- Database: PostgreSQL (Production) or SQLite (Local Development)
- Storage: S3, GCS, or any S3-compatible storage (Minio)
- Protocols: gRPC and Connect (HTTP/1.1 or HTTP/2)
Getting Started Pointers
- Explore the Core API Definitions (v1) in the
flyteidl2package. - Set up a local development environment using the Sandbox Utilities (a k3d-based cluster).
- Check the
manager/directory for the Kubernetes Controller implementation.
Limitations & Assumptions
- Kubernetes Dependency: The execution engine is tightly coupled with Kubernetes. While the Runs Service can run anywhere, the Manager requires a K8s cluster.
- Eventually Consistent Events: Action updates are reported via an events proxy; there may be a slight delay between a pod finishing and the Runs Service reflecting the final state.
- IDL-First: Changes to the core data model require updating Protocol Buffer definitions and regenerating code across multiple languages.
FAQ
What is the difference between a Run and an Action? In Flyte 2, they are the same underlying model. A "Run" is simply the top-level (root) Action of an execution.
How does Flyte handle data between tasks? Flyte uses "Offloaded Storage." Inputs and outputs are written to a configured object store (like S3), and the URIs are passed between actions.
Can I run Flyte without Kubernetes? The Runs Service (API/DB) can run as a standalone binary, but the Manager (which actually executes the tasks) requires Kubernetes to manage the lifecycle of pods and other resources.
Does Flyte support local development?
Yes, the project includes a devbox that spins up a local k3d cluster and uses SQLite by default, making it easy to test the full stack on a laptop.