AbstractDataStore

AbstractDataStore()

Protocol for data storage operations.

This protocol abstracts over different storage backends for dataset data: - S3DataStore: S3-compatible object storage - PDSBlobStore: ATProto PDS blob storage (future)

The separation of index (metadata) from data store (actual files) allows flexible deployment: local index with S3 storage, atmosphere index with S3 storage, or atmosphere index with PDS blobs.

Examples

>>> store = S3DataStore(credentials, bucket="my-bucket")
>>> urls = store.write_shards(dataset, prefix="training/v1")
>>> print(urls)
['s3://my-bucket/training/v1/shard-000000.tar', ...]

Methods

Name Description
read_url Resolve a storage URL for reading.
supports_streaming Whether this store supports streaming reads.
write_shards Write dataset shards to storage.

read_url

AbstractDataStore.read_url(url)

Resolve a storage URL for reading.

Some storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.

Parameters

Name Type Description Default
url str Storage URL to resolve. required

Returns

Name Type Description
str WebDataset-compatible URL for reading.

supports_streaming

AbstractDataStore.supports_streaming()

Whether this store supports streaming reads.

Returns

Name Type Description
bool True if the store supports efficient streaming (like S3),
bool False if data must be fully downloaded first.

write_shards

AbstractDataStore.write_shards(ds, *, prefix, **kwargs)

Write dataset shards to storage.

Parameters

Name Type Description Default
ds Dataset The Dataset to write. required
prefix str Path prefix for the shards (e.g., ‘datasets/mnist/v1’). required
**kwargs Backend-specific options (e.g., maxcount for shard size). {}

Returns

Name Type Description
list[str] List of URLs for the written shards, suitable for use with
list[str] WebDataset or atdata.Dataset().