DatasetLoader
atmosphere.DatasetLoader(client)
Loads dataset records from ATProto.
This class fetches dataset index records and can create Dataset objects from them. Note that loading a dataset requires having the corresponding Python class for the sample type.
Examples
>>> client = AtmosphereClient()
>>> loader = DatasetLoader(client)
>>>
>>> # List available datasets
>>> datasets = loader.list ()
>>> for ds in datasets:
... print (ds["name" ], ds["schemaRef" ])
>>>
>>> # Get a specific dataset record
>>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz" )
Methods
get
Fetch a dataset record by AT URI.
get_blob_urls
Get fetchable URLs for blob-stored dataset shards.
get_blobs
Get the blob references from a dataset record.
get_metadata
Get the metadata from a dataset record.
get_storage_type
Get the storage type of a dataset record.
get_urls
Get the WebDataset URLs from a dataset record.
list_all
List dataset records from a repository.
to_dataset
Create a Dataset object from an ATProto record.
get
atmosphere.DatasetLoader.get(uri)
Fetch a dataset record by AT URI.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
Returns
dict
The dataset record as a dictionary.
get_blob_urls
atmosphere.DatasetLoader.get_blob_urls(uri)
Get fetchable URLs for blob-stored dataset shards.
This resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
Returns
list [str ]
List of URLs for fetching the blob data.
Raises
ValueError
If storage type is not blobs or PDS cannot be resolved.
get_blobs
atmosphere.DatasetLoader.get_blobs(uri)
Get the blob references from a dataset record.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
Returns
list [dict ]
List of blob reference dicts with keys: $type, ref, mimeType, size.
get_storage_type
atmosphere.DatasetLoader.get_storage_type(uri)
Get the storage type of a dataset record.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
Returns
str
Either “external” or “blobs”.
get_urls
atmosphere.DatasetLoader.get_urls(uri)
Get the WebDataset URLs from a dataset record.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
Raises
ValueError
If the storage type is not external URLs.
list_all
atmosphere.DatasetLoader.list_all(repo= None , limit= 100 )
List dataset records from a repository.
Parameters
repo
Optional [str ]
The DID of the repository. Defaults to authenticated user.
None
limit
int
Maximum number of records to return.
100
to_dataset
atmosphere.DatasetLoader.to_dataset(uri, sample_type)
Create a Dataset object from an ATProto record.
This method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.
Supports both external URL storage and ATProto blob storage.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
sample_type
Type [ST ]
The Python class for the sample type.
required
Returns
Dataset [ST ]
A Dataset instance configured from the record.
Examples
>>> loader = DatasetLoader(client)
>>> dataset = loader.to_dataset(uri, MySampleType)
>>> for batch in dataset.shuffled(batch_size= 32 ):
... process(batch)