DatasetPublisher
atmosphere.DatasetPublisher(client)Publishes dataset index records to ATProto.
This class creates dataset records that reference a schema and point to external storage (WebDataset URLs) or ATProto blobs.
Examples
>>> dataset = atdata.Dataset[MySample]("s3://bucket/data-{000000..000009}.tar")
>>>
>>> client = AtmosphereClient()
>>> client.login("handle", "password")
>>>
>>> publisher = DatasetPublisher(client)
>>> uri = publisher.publish(
... dataset,
... name="My Training Data",
... description="Training data for my model",
... tags=["computer-vision", "training"],
... )Methods
| Name | Description |
|---|---|
| publish | Publish a dataset index record to ATProto. |
| publish_with_blobs | Publish a dataset with data stored as ATProto blobs. |
| publish_with_urls | Publish a dataset record with explicit URLs. |
publish
atmosphere.DatasetPublisher.publish(
dataset,
*,
name,
schema_uri=None,
description=None,
tags=None,
license=None,
auto_publish_schema=True,
schema_version='1.0.0',
rkey=None,
)Publish a dataset index record to ATProto.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| dataset | Dataset[ST] | The Dataset to publish. | required |
| name | str | Human-readable dataset name. | required |
| schema_uri | Optional[str] | AT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published. | None |
| description | Optional[str] | Human-readable description. | None |
| tags | Optional[list[str]] | Searchable tags for discovery. | None |
| license | Optional[str] | SPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’). | None |
| auto_publish_schema | bool | If True and schema_uri not provided, automatically publish the schema first. | True |
| schema_version | str | Version for auto-published schema. | '1.0.0' |
| rkey | Optional[str] | Optional explicit record key. | None |
Returns
| Name | Type | Description |
|---|---|---|
| AtUri | The AT URI of the created dataset record. |
Raises
| Name | Type | Description |
|---|---|---|
| ValueError | If schema_uri is not provided and auto_publish_schema is False. |
publish_with_blobs
atmosphere.DatasetPublisher.publish_with_blobs(
blobs,
schema_uri,
*,
name,
description=None,
tags=None,
license=None,
metadata=None,
mime_type='application/x-tar',
rkey=None,
)Publish a dataset with data stored as ATProto blobs.
This method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| blobs | list[bytes] | List of binary data (e.g., tar shards) to upload as blobs. | required |
| schema_uri | str | AT URI of the schema record. | required |
| name | str | Human-readable dataset name. | required |
| description | Optional[str] | Human-readable description. | None |
| tags | Optional[list[str]] | Searchable tags for discovery. | None |
| license | Optional[str] | SPDX license identifier. | None |
| metadata | Optional[dict] | Arbitrary metadata dictionary. | None |
| mime_type | str | MIME type for the blobs (default: application/x-tar). | 'application/x-tar' |
| rkey | Optional[str] | Optional explicit record key. | None |
Returns
| Name | Type | Description |
|---|---|---|
| AtUri | The AT URI of the created dataset record. |
Note
Blobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.
publish_with_urls
atmosphere.DatasetPublisher.publish_with_urls(
urls,
schema_uri,
*,
name,
description=None,
tags=None,
license=None,
metadata=None,
rkey=None,
)Publish a dataset record with explicit URLs.
This method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| urls | list[str] | List of WebDataset URLs with brace notation. | required |
| schema_uri | str | AT URI of the schema record. | required |
| name | str | Human-readable dataset name. | required |
| description | Optional[str] | Human-readable description. | None |
| tags | Optional[list[str]] | Searchable tags for discovery. | None |
| license | Optional[str] | SPDX license identifier. | None |
| metadata | Optional[dict] | Arbitrary metadata dictionary. | None |
| rkey | Optional[str] | Optional explicit record key. | None |
Returns
| Name | Type | Description |
|---|---|---|
| AtUri | The AT URI of the created dataset record. |