DatasetPublisher

atmosphere.DatasetPublisher(client)

Publishes dataset index records to ATProto.

This class creates dataset records that reference a schema and point to external storage (WebDataset URLs) or ATProto blobs.

Examples

>>> dataset = atdata.Dataset[MySample]("s3://bucket/data-{000000..000009}.tar")
>>>
>>> client = AtmosphereClient()
>>> client.login("handle", "password")
>>>
>>> publisher = DatasetPublisher(client)
>>> uri = publisher.publish(
...     dataset,
...     name="My Training Data",
...     description="Training data for my model",
...     tags=["computer-vision", "training"],
... )

Methods

Name Description
publish Publish a dataset index record to ATProto.
publish_with_blobs Publish a dataset with data stored as ATProto blobs.
publish_with_urls Publish a dataset record with explicit URLs.

publish

atmosphere.DatasetPublisher.publish(
    dataset,
    *,
    name,
    schema_uri=None,
    description=None,
    tags=None,
    license=None,
    auto_publish_schema=True,
    schema_version='1.0.0',
    rkey=None,
)

Publish a dataset index record to ATProto.

Parameters

Name Type Description Default
dataset Dataset[ST] The Dataset to publish. required
name str Human-readable dataset name. required
schema_uri Optional[str] AT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published. None
description Optional[str] Human-readable description. None
tags Optional[list[str]] Searchable tags for discovery. None
license Optional[str] SPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’). None
auto_publish_schema bool If True and schema_uri not provided, automatically publish the schema first. True
schema_version str Version for auto-published schema. '1.0.0'
rkey Optional[str] Optional explicit record key. None

Returns

Name Type Description
AtUri The AT URI of the created dataset record.

Raises

Name Type Description
ValueError If schema_uri is not provided and auto_publish_schema is False.

publish_with_blobs

atmosphere.DatasetPublisher.publish_with_blobs(
    blobs,
    schema_uri,
    *,
    name,
    description=None,
    tags=None,
    license=None,
    metadata=None,
    mime_type='application/x-tar',
    rkey=None,
)

Publish a dataset with data stored as ATProto blobs.

This method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).

Parameters

Name Type Description Default
blobs list[bytes] List of binary data (e.g., tar shards) to upload as blobs. required
schema_uri str AT URI of the schema record. required
name str Human-readable dataset name. required
description Optional[str] Human-readable description. None
tags Optional[list[str]] Searchable tags for discovery. None
license Optional[str] SPDX license identifier. None
metadata Optional[dict] Arbitrary metadata dictionary. None
mime_type str MIME type for the blobs (default: application/x-tar). 'application/x-tar'
rkey Optional[str] Optional explicit record key. None

Returns

Name Type Description
AtUri The AT URI of the created dataset record.

Note

Blobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.

publish_with_urls

atmosphere.DatasetPublisher.publish_with_urls(
    urls,
    schema_uri,
    *,
    name,
    description=None,
    tags=None,
    license=None,
    metadata=None,
    rkey=None,
)

Publish a dataset record with explicit URLs.

This method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.

Parameters

Name Type Description Default
urls list[str] List of WebDataset URLs with brace notation. required
schema_uri str AT URI of the schema record. required
name str Human-readable dataset name. required
description Optional[str] Human-readable description. None
tags Optional[list[str]] Searchable tags for discovery. None
license Optional[str] SPDX license identifier. None
metadata Optional[dict] Arbitrary metadata dictionary. None
rkey Optional[str] Optional explicit record key. None

Returns

Name Type Description
AtUri The AT URI of the created dataset record.