DataSource
DataSource()Protocol for data sources that provide streams to Dataset.
A DataSource abstracts over different ways of accessing dataset shards: - URLSource: Standard WebDataset-compatible URLs (http, https, pipe, gs, etc.) - S3Source: S3-compatible storage with explicit credentials - BlobSource: ATProto blob references (future)
The key method is shards(), which yields (identifier, stream) pairs. These are fed directly to WebDataset’s tar_file_expander, bypassing URL resolution entirely. This enables: - Private S3 repos with credentials - Custom endpoints (Cloudflare R2, MinIO) - ATProto blob streaming - Any other source that can provide file-like objects
Examples
>>> source = S3Source(
... bucket="my-bucket",
... keys=["data-000.tar", "data-001.tar"],
... endpoint="https://r2.example.com",
... credentials=creds,
... )
>>> ds = Dataset[MySample](source)
>>> for sample in ds.ordered():
... print(sample)Attributes
| Name | Description |
|---|---|
| shards | Lazily yield (identifier, stream) pairs for each shard. |
Methods
| Name | Description |
|---|---|
| list_shards | Get list of shard identifiers without opening streams. |
| open_shard | Open a single shard by its identifier. |
list_shards
DataSource.list_shards()Get list of shard identifiers without opening streams.
Used for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.
Returns
| Name | Type | Description |
|---|---|---|
| list[str] | List of shard identifier strings. |
open_shard
DataSource.open_shard(shard_id)Open a single shard by its identifier.
This method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| shard_id | str | Shard identifier from shard_list. | required |
Returns
| Name | Type | Description |
|---|---|---|
| IO[bytes] | File-like stream for reading the shard. |
Raises
| Name | Type | Description |
|---|---|---|
| KeyError | If shard_id is not in shard_list. |