load_dataset
load_dataset(
path,
sample_type=None,
*,
split=None,
data_files=None,
streaming=False,
index=None,
)Load a dataset from local files, remote URLs, or an index.
This function provides a HuggingFace Datasets-style interface for loading atdata typed datasets. It handles path resolution, split detection, and returns either a single Dataset or a DatasetDict depending on the split parameter.
When no sample_type is provided, returns a Dataset[DictSample] that provides dynamic dict-like access to fields. Use .as_type(MyType) to convert to a typed schema.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| path | str | Path to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar” | required |
| sample_type | Type[ST] | None | The PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax. |
None |
| split | str | None | Which split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split. | None |
| data_files | str | list[str] | dict[str, str | list[str]] | None | Optional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping | None |
| streaming | bool | If True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent. | False |
| index | Optional['AbstractIndex'] | Optional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index. | None |
Returns
| Name | Type | Description |
|---|---|---|
| Dataset[ST] | DatasetDict[ST] | If split is None: DatasetDict with all detected splits. | |
| Dataset[ST] | DatasetDict[ST] | If split is specified: Dataset for that split. | |
| Dataset[ST] | DatasetDict[ST] | Type is ST if sample_type provided, otherwise DictSample. |
Raises
| Name | Type | Description |
|---|---|---|
| ValueError | If the specified split is not found. | |
| FileNotFoundError | If no data files are found at the path. | |
| KeyError | If dataset not found in index. |
Examples
>>> # Load without type - get DictSample for exploration
>>> ds = load_dataset("./data/train.tar", split="train")
>>> for sample in ds.ordered():
... print(sample.keys()) # Explore fields
... print(sample["text"]) # Dict-style access
... print(sample.label) # Attribute access
>>>
>>> # Convert to typed schema
>>> typed_ds = ds.as_type(TextData)
>>>
>>> # Or load with explicit type directly
>>> train_ds = load_dataset("./data/train-*.tar", TextData, split="train")
>>>
>>> # Load from index with auto-type resolution
>>> index = LocalIndex()
>>> ds = load_dataset("@local/my-dataset", index=index, split="train")