Constructing Data References
In Flyte, data is addressed using DataReference, a string-based type that represents URIs for various storage backends like S3, GCS, Azure Blob Storage, or local filesystems. To ensure that paths are constructed consistently across these different environments, Flyte provides the ReferenceConstructor interface and its default implementation, URLPathConstructor.
This tutorial walks you through building and managing storage paths using these tools.
Prerequisites
To follow this tutorial, you need the flytestdlib package available in your Go environment:
go get github.com/flyteorg/flyte/v2/flytestdlib
Step 1: Initialize the URLPathConstructor
The URLPathConstructor is the standard tool for joining path components. It treats DataReference strings as URLs and ensures that nested keys are resolved correctly, handling separators automatically.
import (
"context"
"fmt"
"github.com/flyteorg/flyte/v2/flytestdlib/storage"
)
func main() {
ctx := context.Background()
constructor := storage.NewURLPathConstructor()
base := storage.DataReference("s3://my-bucket/outputs")
// Construct a new reference by appending nested keys
ref, err := constructor.ConstructReference(ctx, base, "project", "task_id", "metadata.json")
if err != nil {
panic(err)
}
fmt.Println(ref)
// Output: s3://my-bucket/outputs/project/task_id/metadata.json
}
The ConstructReference method is designed to be "slash-safe." It ensures the base reference ends with a / before resolving nested keys, preventing the common error where the last component of a base path is accidentally dropped during URL resolution.
Step 2: Build Remote Paths from Local Files
A common pattern in Flyte (used in components like flytecopilot) is uploading a local directory structure to a remote store. You can achieve this by converting local relative paths into keys for the ReferenceConstructor.
import (
"context"
"path/filepath"
"strings"
"github.com/flyteorg/flyte/v2/flytestdlib/storage"
)
func BuildRemotePath(ctx context.Context, store storage.ReferenceConstructor, remoteBase storage.DataReference, localRoot, filePath string) (storage.DataReference, error) {
// Get the relative path from the root (e.g., "logs/error.txt")
rel, err := filepath.Rel(localRoot, filePath)
if err != nil {
return "", err
}
// Convert local OS separators to URL slashes and split into keys
keys := strings.Split(filepath.ToSlash(rel), "/")
// Construct the final remote DataReference
return store.ConstructReference(ctx, remoteBase, keys...)
}
This approach ensures that regardless of whether your code runs on Windows or Linux, the resulting DataReference in S3 or GCS uses the standard / separator.
Step 3: Create Deterministic Task Output Paths
Flyte plugins often need to generate deterministic paths for task outputs based on execution metadata. This ensures that retries or different nodes don't overwrite each other's data unless intended.
import (
"context"
"strconv"
"github.com/flyteorg/flyte/v2/flytestdlib/storage"
)
// Example inspired by flyteplugins/go/tasks/pluginmachinery/ioutils/raw_output_path.go
func GetTaskOutputPath(ctx context.Context, store storage.ReferenceConstructor, prefix storage.DataReference, project, domain, name, nodeID string, retryAttempt int) (storage.DataReference, error) {
return store.ConstructReference(ctx, prefix,
project,
domain,
name,
nodeID,
strconv.Itoa(retryAttempt),
"outputs.pb",
)
}
By passing each metadata field as a separate nested key, URLPathConstructor handles the joining logic, ensuring a clean, hierarchical structure in your storage bucket.
Step 4: Inspect and Split Data References
Once you have a DataReference, you may need to extract specific components like the bucket name or the object key for storage-specific operations. The Split() method handles this, including special logic for complex schemes like Azure ADLS Gen2 (abfs://).
func InspectReference(ref storage.DataReference) {
scheme, container, key, err := ref.Split()
if err != nil {
fmt.Printf("Error splitting: %v\n", err)
return
}
fmt.Printf("Scheme: %s\n", scheme) // e.g., "s3"
fmt.Printf("Container: %s\n", container) // e.g., "my-bucket"
fmt.Printf("Key: %s\n", key) // e.g., "outputs/metadata.json"
}
For Azure abfs schemes, Split() correctly identifies the filesystem (container) from the userinfo part of the URL (e.g., abfs://container@account.dfs.core.windows.net/), which standard URL parsers might otherwise miss.
Summary
By using DataReference and ReferenceConstructor, you ensure that:
- Paths are consistent: Separators are handled correctly across different operating systems and storage providers.
- Logic is reusable: The same path-building code works for S3, GCS, and local development.
- Metadata is accessible: You can easily decompose URIs into their constituent parts for backend-specific API calls.
For most use cases, you will interact with these through a DataStore object, which embeds a ReferenceConstructor and provides higher-level methods for reading and writing data.