atdata

A loose federation of distributed, typed datasets built on WebDataset

The Challenge

Machine learning datasets are everywhere—training data, validation sets, embeddings, features, model outputs. Yet working with them often means:

  • Runtime surprises: Discovering a field is missing or has the wrong type during training
  • Copy-paste schemas: Redefining the same sample structure across notebooks and scripts
  • Storage silos: Data stuck in one location, invisible to collaborators
  • Discovery friction: No standard way to find datasets across teams or organizations

atdata solves these problems with a simple idea: typed, serializable samples that flow seamlessly from local development to team storage to federated sharing.

What is atdata?

atdata is a Python library that combines:

Typed Samples

Define dataclass-based sample types with automatic msgpack serialization. Catch schema errors at definition time, not training time.

Efficient Storage

Built on WebDataset’s proven tar-based format. Stream large datasets without downloading everything first.

Lens Transformations

View datasets through different schemas without duplicating data. Perfect for feature extraction, schema migration, and multi-task learning.

Batch Aggregation

Automatic numpy stacking for NDArray fields. No more manual collation code—just iterate and train.

Team Storage

Redis + S3 backend for shared dataset indexes. Publish schemas, track versions, and enable team discovery.

ATProto Federation

Publish datasets to the decentralized AT Protocol network. Enable cross-organization discovery without centralized infrastructure.

The Architecture

atdata provides a three-layer progression for your datasets:

┌─────────────────────────────────────────────────────────────┐
│  Federation: ATProto Atmosphere                             │
│  Decentralized discovery, cross-org sharing                 │
└─────────────────────────────────────────────────────────────┘
                              ↑ promote
┌─────────────────────────────────────────────────────────────┐
│  Team Storage: Redis + S3                                   │
│  Shared index, versioned schemas, S3 data                   │
└─────────────────────────────────────────────────────────────┘
                              ↑ insert
┌─────────────────────────────────────────────────────────────┐
│  Local Development                                          │
│  Typed samples, WebDataset files, fast iteration            │
└─────────────────────────────────────────────────────────────┘

Start local, scale to your team, and optionally share with the world—all with the same sample types and consistent APIs.

Installation

pip install atdata

# With ATProto support
pip install atdata[atmosphere]

Quick Example

1. Define a Sample Type

The @packable decorator creates a serializable dataclass:

import numpy as np
from numpy.typing import NDArray
import atdata

@atdata.packable
class ImageSample:
    image: NDArray      # Automatically handled as bytes
    label: str
    confidence: float

2. Create and Write Samples

Use WebDataset’s standard TarWriter:

import webdataset as wds

samples = [
    ImageSample(
        image=np.random.rand(224, 224, 3).astype(np.float32),
        label="cat",
        confidence=0.95,
    )
    for _ in range(100)
]

with wds.writer.TarWriter("data-000000.tar") as sink:
    for i, sample in enumerate(samples):
        sink.write({**sample.as_wds, "__key__": f"sample_{i:06d}"})

3. Load and Iterate with Type Safety

The generic Dataset[T] provides typed access:

dataset = atdata.Dataset[ImageSample]("data-000000.tar")

for batch in dataset.shuffled(batch_size=32):
    images = batch.image      # numpy array (32, 224, 224, 3)
    labels = batch.label      # list of 32 strings
    confs = batch.confidence  # list of 32 floats

Scaling Up

Team Storage with Redis + S3

When you’re ready to share with your team:

from atdata.local import LocalIndex, S3DataStore

# Connect to team infrastructure
store = S3DataStore(
    credentials={"AWS_ENDPOINT": "http://localhost:9000", ...},
    bucket="team-datasets",
)
index = LocalIndex(data_store=store)

# Publish schema for consistency
index.publish_schema(ImageSample, version="1.0.0")

# Insert dataset (writes to S3, indexes in Redis)
dataset = atdata.Dataset[ImageSample]("data.tar")
entry = index.insert_dataset(dataset, name="training-images-v1")

# Team members can now discover and load
# ds = atdata.load_dataset("@local/training-images-v1", index=index)

Federation with ATProto

For public or cross-organization sharing:

from atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore
from atdata.promote import promote_to_atmosphere

# Authenticate with your ATProto identity
client = AtmosphereClient()
client.login("handle.bsky.social", "app-password")

# Option 1: Promote existing local dataset
entry = index.get_dataset("training-images-v1")
at_uri = promote_to_atmosphere(entry, index, client)

# Option 2: Publish directly with blob storage
store = PDSBlobStore(client)
atm_index = AtmosphereIndex(client, data_store=store)
atm_index.insert_dataset(dataset, name="public-images", schema_ref=schema_uri)

HuggingFace-Style Loading

For convenient access to datasets:

from atdata import load_dataset

# Load from local files
ds = load_dataset("path/to/data-{000000..000009}.tar")

# Load with split detection
ds_dict = load_dataset("path/to/data/")
train_ds = ds_dict["train"]
test_ds = ds_dict["test"]

# Load from index
ds = load_dataset("@local/my-dataset", index=index)

Why atdata?

Need Solution
Type-safe samples @packable decorator, PackableSample base class
Efficient large-scale storage WebDataset tar format, streaming iteration
Schema flexibility Lens transformations, DictSample for exploration
Team collaboration Redis index, S3 data store, schema registry
Public sharing ATProto federation, content-addressable CIDs
Multiple backends Protocol abstractions (AbstractIndex, DataSource)

Next Steps

Getting Started

New to atdata? Start with the Quick Start Tutorial to learn the basics of typed samples and datasets.