AbstractDataStore
AbstractDataStore()Protocol for data storage operations.
This protocol abstracts over different storage backends for dataset data: - S3DataStore: S3-compatible object storage - PDSBlobStore: ATProto PDS blob storage (future)
The separation of index (metadata) from data store (actual files) allows flexible deployment: local index with S3 storage, atmosphere index with S3 storage, or atmosphere index with PDS blobs.
Examples
>>> store = S3DataStore(credentials, bucket="my-bucket")
>>> urls = store.write_shards(dataset, prefix="training/v1")
>>> print(urls)
['s3://my-bucket/training/v1/shard-000000.tar', ...]Methods
| Name | Description |
|---|---|
| read_url | Resolve a storage URL for reading. |
| supports_streaming | Whether this store supports streaming reads. |
| write_shards | Write dataset shards to storage. |
read_url
AbstractDataStore.read_url(url)Resolve a storage URL for reading.
Some storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| url | str | Storage URL to resolve. | required |
Returns
| Name | Type | Description |
|---|---|---|
| str | WebDataset-compatible URL for reading. |
supports_streaming
AbstractDataStore.supports_streaming()Whether this store supports streaming reads.
Returns
| Name | Type | Description |
|---|---|---|
| bool | True if the store supports efficient streaming (like S3), | |
| bool | False if data must be fully downloaded first. |
write_shards
AbstractDataStore.write_shards(ds, *, prefix, **kwargs)Write dataset shards to storage.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| ds | Dataset | The Dataset to write. | required |
| prefix | str | Path prefix for the shards (e.g., ‘datasets/mnist/v1’). | required |
| **kwargs | Backend-specific options (e.g., maxcount for shard size). | {} |
Returns
| Name | Type | Description |
|---|---|---|
| list[str] | List of URLs for the written shards, suitable for use with | |
| list[str] | WebDataset or atdata.Dataset(). |