Overview

Flyte is an open-source orchestrator designed to manage the lifecycle of machine learning and data processing workflows. It provides a structured way to define, execute, and track complex pipelines across distributed infrastructure, ensuring reliability through built-in retries, state management, and observability.

The Problem: Distributed Chaos

Building ML and data pipelines often involves stitching together disparate tools (Spark, Ray, Kubernetes pods) while handling transient failures, data lineage, and resource scaling. Without a central orchestrator, developers end up writing "glue code" for retries, state persistence, and monitoring, which is brittle and hard to maintain.

Flyte solves this by providing a unified API and a robust execution engine that abstracts away the underlying infrastructure, allowing engineers to focus on business logic rather than distributed systems plumbing.

Core Concepts

Run: A single execution of a workflow or task. In Flyte's internal model, a Run is simply a "root action" that can spawn child actions.
Action: The fundamental unit of work. Actions can be individual tasks (like running a container) or complex structures (like a sub-workflow or a conditional branch).
Attempt: Every action can have multiple attempts. Flyte automatically tracks these attempts, recording their logs, phase transitions (e.g., Queued -> Running -> Succeeded), and any errors.
Plugin: The execution logic for an action. Flyte uses plugins to interface with different backends, such as Kubernetes Pods, AWS Batch, or specialized engines like Spark and Ray.
TaskAction CRD: A Kubernetes Custom Resource that acts as the bridge between the Flyte control plane and the execution cluster.

How it Works: The Architecture

Flyte 2 follows a decoupled architecture where state management and execution are handled by specialized services:

Runs Service: The source of truth. It manages the database (PostgreSQL or SQLite) and exposes a gRPC/Connect API for creating and monitoring runs.
Manager (Executor): A Kubernetes controller that watches for TaskAction resources. When a new action is enqueued, the Manager resolves the appropriate plugin and handles the actual execution (e.g., launching a Pod).
Events Proxy: As actions progress in the cluster, the Manager sends real-time events back to the Runs Service to update the global state.
Offloaded Storage: Large inputs and outputs are stored in object storage (S3/GCS/Minio), with only the metadata and URIs persisted in the database.

Use Cases

Creating a Containerized Run

You can trigger a run by providing a TaskSpec that defines the container image and command to execute.

import (
    "github.com/flyteorg/flyte/v2/gen/go/flyteidl2/common"
    "github.com/flyteorg/flyte/v2/gen/go/flyteidl2/core"
    "github.com/flyteorg/flyte/v2/gen/go/flyteidl2/task"
    "github.com/flyteorg/flyte/v2/gen/go/flyteidl2/workflow"
    "github.com/flyteorg/flyte/v2/gen/go/flyteidl2/workflow/workflowconnect"
)

// Initialize the client
client := workflowconnect.NewRunServiceClient(http.DefaultClient, "http://localhost:8090")

// Define and create the run
req := &workflow.CreateRunRequest{
    Id: &workflow.CreateRunRequest_ProjectId{
        ProjectId: &common.ProjectIdentifier{Name: "my-project", Domain: "development"},
    },
    Task: &workflow.CreateRunRequest_TaskSpec{
        TaskSpec: &task.TaskSpec{
            TaskTemplate: &core.TaskTemplate{
                Target: &core.TaskTemplate_Container{
                    Container: &core.Container{
                        Image: "python:3.9",
                        Args:  []string{"python", "-c", "print('Hello Flyte')"},
                    },
                },
            },
        },
    },
}
resp, err := client.CreateRun(ctx, connect.NewRequest(req))

Monitoring Action Progress

Since Flyte tracks every attempt and phase transition, you can query the status of specific actions within a run.

# List actions for a specific run using the CLI
runs-client list-actions --project my-project --domain development --run-name r12345

When to Use Flyte

Use Flyte when you need to run complex, multi-step ML pipelines that require high reliability, resource isolation, and a clear audit trail of every execution.
Use Flyte when you want to provide a "Platform as a Service" experience for data scientists, allowing them to run code on K8s without managing YAML.
Do NOT use Flyte when you have simple, short-lived scripts that don't require retries or distributed execution. A simple cron job or CI/CD runner might be sufficient.

Stack Compatibility

Primary Language: Go (Services and Manager)
Runtime Environment: Kubernetes (required for the Manager/Executor)
Database: PostgreSQL (Production) or SQLite (Local Development)
Storage: S3, GCS, or any S3-compatible storage (Minio)
Protocols: gRPC and Connect (HTTP/1.1 or HTTP/2)

Getting Started Pointers

Explore the Core API Definitions (v1) in the flyteidl2 package.
Set up a local development environment using the Sandbox Utilities (a k3d-based cluster).
Check the manager/ directory for the Kubernetes Controller implementation.

Limitations & Assumptions

Kubernetes Dependency: The execution engine is tightly coupled with Kubernetes. While the Runs Service can run anywhere, the Manager requires a K8s cluster.
Eventually Consistent Events: Action updates are reported via an events proxy; there may be a slight delay between a pod finishing and the Runs Service reflecting the final state.
IDL-First: Changes to the core data model require updating Protocol Buffer definitions and regenerating code across multiple languages.

FAQ

What is the difference between a Run and an Action? In Flyte 2, they are the same underlying model. A "Run" is simply the top-level (root) Action of an execution.

How does Flyte handle data between tasks? Flyte uses "Offloaded Storage." Inputs and outputs are written to a configured object store (like S3), and the URIs are passed between actions.

Can I run Flyte without Kubernetes? The Runs Service (API/DB) can run as a standalone binary, but the Manager (which actually executes the tasks) requires Kubernetes to manage the lifecycle of pods and other resources.

Does Flyte support local development? Yes, the project includes a devbox that spins up a local k3d cluster and uses SQLite by default, making it easy to test the full stack on a laptop.

The Problem: Distributed Chaos​

Core Concepts​

How it Works: The Architecture​

Use Cases​

Creating a Containerized Run​

Monitoring Action Progress​

When to Use Flyte​

Stack Compatibility​

Getting Started Pointers​

Limitations & Assumptions​

FAQ​