DatasetLoader

atmosphere.DatasetLoader(client)

Loads dataset records from ATProto.

This class fetches dataset index records and can create Dataset objects from them. Note that loading a dataset requires having the corresponding Python class for the sample type.

Examples

>>> client = AtmosphereClient()
>>> loader = DatasetLoader(client)
>>>
>>> # List available datasets
>>> datasets = loader.list()
>>> for ds in datasets:
...     print(ds["name"], ds["schemaRef"])
>>>
>>> # Get a specific dataset record
>>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz")

Methods

Name Description
get Fetch a dataset record by AT URI.
get_blob_urls Get fetchable URLs for blob-stored dataset shards.
get_blobs Get the blob references from a dataset record.
get_metadata Get the metadata from a dataset record.
get_storage_type Get the storage type of a dataset record.
get_urls Get the WebDataset URLs from a dataset record.
list_all List dataset records from a repository.
to_dataset Create a Dataset object from an ATProto record.

get

atmosphere.DatasetLoader.get(uri)

Fetch a dataset record by AT URI.

Parameters

Name Type Description Default
uri str | AtUri The AT URI of the dataset record. required

Returns

Name Type Description
dict The dataset record as a dictionary.

Raises

Name Type Description
ValueError If the record is not a dataset record.

get_blob_urls

atmosphere.DatasetLoader.get_blob_urls(uri)

Get fetchable URLs for blob-stored dataset shards.

This resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.

Parameters

Name Type Description Default
uri str | AtUri The AT URI of the dataset record. required

Returns

Name Type Description
list[str] List of URLs for fetching the blob data.

Raises

Name Type Description
ValueError If storage type is not blobs or PDS cannot be resolved.

get_blobs

atmosphere.DatasetLoader.get_blobs(uri)

Get the blob references from a dataset record.

Parameters

Name Type Description Default
uri str | AtUri The AT URI of the dataset record. required

Returns

Name Type Description
list[dict] List of blob reference dicts with keys: $type, ref, mimeType, size.

Raises

Name Type Description
ValueError If the storage type is not blobs.

get_metadata

atmosphere.DatasetLoader.get_metadata(uri)

Get the metadata from a dataset record.

Parameters

Name Type Description Default
uri str | AtUri The AT URI of the dataset record. required

Returns

Name Type Description
Optional[dict] The metadata dictionary, or None if no metadata.

get_storage_type

atmosphere.DatasetLoader.get_storage_type(uri)

Get the storage type of a dataset record.

Parameters

Name Type Description Default
uri str | AtUri The AT URI of the dataset record. required

Returns

Name Type Description
str Either “external” or “blobs”.

Raises

Name Type Description
ValueError If storage type is unknown.

get_urls

atmosphere.DatasetLoader.get_urls(uri)

Get the WebDataset URLs from a dataset record.

Parameters

Name Type Description Default
uri str | AtUri The AT URI of the dataset record. required

Returns

Name Type Description
list[str] List of WebDataset URLs.

Raises

Name Type Description
ValueError If the storage type is not external URLs.

list_all

atmosphere.DatasetLoader.list_all(repo=None, limit=100)

List dataset records from a repository.

Parameters

Name Type Description Default
repo Optional[str] The DID of the repository. Defaults to authenticated user. None
limit int Maximum number of records to return. 100

Returns

Name Type Description
list[dict] List of dataset records.

to_dataset

atmosphere.DatasetLoader.to_dataset(uri, sample_type)

Create a Dataset object from an ATProto record.

This method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.

Supports both external URL storage and ATProto blob storage.

Parameters

Name Type Description Default
uri str | AtUri The AT URI of the dataset record. required
sample_type Type[ST] The Python class for the sample type. required

Returns

Name Type Description
Dataset[ST] A Dataset instance configured from the record.

Raises

Name Type Description
ValueError If no storage URLs can be resolved.

Examples

>>> loader = DatasetLoader(client)
>>> dataset = loader.to_dataset(uri, MySampleType)
>>> for batch in dataset.shuffled(batch_size=32):
...     process(batch)