DataSource

DataSource()

Protocol for data sources that provide streams to Dataset.

A DataSource abstracts over different ways of accessing dataset shards: - URLSource: Standard WebDataset-compatible URLs (http, https, pipe, gs, etc.) - S3Source: S3-compatible storage with explicit credentials - BlobSource: ATProto blob references (future)

The key method is shards(), which yields (identifier, stream) pairs. These are fed directly to WebDataset’s tar_file_expander, bypassing URL resolution entirely. This enables: - Private S3 repos with credentials - Custom endpoints (Cloudflare R2, MinIO) - ATProto blob streaming - Any other source that can provide file-like objects

Examples

>>> source = S3Source(
...     bucket="my-bucket",
...     keys=["data-000.tar", "data-001.tar"],
...     endpoint="https://r2.example.com",
...     credentials=creds,
... )
>>> ds = Dataset[MySample](source)
>>> for sample in ds.ordered():
...     print(sample)

Attributes

Name Description
shards Lazily yield (identifier, stream) pairs for each shard.

Methods

Name Description
list_shards Get list of shard identifiers without opening streams.
open_shard Open a single shard by its identifier.

list_shards

DataSource.list_shards()

Get list of shard identifiers without opening streams.

Used for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.

Returns

Name Type Description
list[str] List of shard identifier strings.

open_shard

DataSource.open_shard(shard_id)

Open a single shard by its identifier.

This method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.

Parameters

Name Type Description Default
shard_id str Shard identifier from shard_list. required

Returns

Name Type Description
IO[bytes] File-like stream for reading the shard.

Raises

Name Type Description
KeyError If shard_id is not in shard_list.