load_dataset

load_dataset(
    path,
    sample_type=None,
    *,
    split=None,
    data_files=None,
    streaming=False,
    index=None,
)

Load a dataset from local files, remote URLs, or an index.

This function provides a HuggingFace Datasets-style interface for loading atdata typed datasets. It handles path resolution, split detection, and returns either a single Dataset or a DatasetDict depending on the split parameter.

When no sample_type is provided, returns a Dataset[DictSample] that provides dynamic dict-like access to fields. Use .as_type(MyType) to convert to a typed schema.

Parameters

Name Type Description Default
path str Path to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar” required
sample_type Type[ST] | None The PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax. None
split str | None Which split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split. None
data_files str | list[str] | dict[str, str | list[str]] | None Optional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping None
streaming bool If True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent. False
index Optional['AbstractIndex'] Optional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index. None

Returns

Name Type Description
Dataset[ST] | DatasetDict[ST] If split is None: DatasetDict with all detected splits.
Dataset[ST] | DatasetDict[ST] If split is specified: Dataset for that split.
Dataset[ST] | DatasetDict[ST] Type is ST if sample_type provided, otherwise DictSample.

Raises

Name Type Description
ValueError If the specified split is not found.
FileNotFoundError If no data files are found at the path.
KeyError If dataset not found in index.

Examples

>>> # Load without type - get DictSample for exploration
>>> ds = load_dataset("./data/train.tar", split="train")
>>> for sample in ds.ordered():
...     print(sample.keys())  # Explore fields
...     print(sample["text"]) # Dict-style access
...     print(sample.label)   # Attribute access
>>>
>>> # Convert to typed schema
>>> typed_ds = ds.as_type(TextData)
>>>
>>> # Or load with explicit type directly
>>> train_ds = load_dataset("./data/train-*.tar", TextData, split="train")
>>>
>>> # Load from index with auto-type resolution
>>> index = LocalIndex()
>>> ds = load_dataset("@local/my-dataset", index=index, split="train")