load_dataset

load_dataset(
    path,
    sample_type=None,
    *,
    split=None,
    data_files=None,
    streaming=False,
    index=None,
)

Load a dataset from local files, remote URLs, or an index.

This function provides a HuggingFace Datasets-style interface for loading atdata typed datasets. It handles path resolution, split detection, and returns either a single Dataset or a DatasetDict depending on the split parameter.

When no sample_type is provided, returns a Dataset[DictSample] that provides dynamic dict-like access to fields. Use .as_type(MyType) to convert to a typed schema.

Parameters

Name	Type	Description	Default
path	str	Path to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar”	required
sample_type	Type[ST] \| None	The PackableSample subclass defining the schema. If None, returns `Dataset[DictSample]` with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax.	`None`
split	str \| None	Which split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split.	`None`
data_files	str \| list[str] \| dict[str, str \| list[str]] \| None	Optional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str \| list[str]]: Explicit split -> files mapping	`None`
streaming	bool	If True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent.	`False`
index	Optional['AbstractIndex']	Optional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index.	`None`

Returns

Name	Type	Description
	Dataset[ST] \| DatasetDict[ST]	If split is None: DatasetDict with all detected splits.
	Dataset[ST] \| DatasetDict[ST]	If split is specified: Dataset for that split.
	Dataset[ST] \| DatasetDict[ST]	Type is `ST` if sample_type provided, otherwise `DictSample`.

Raises

Name	Type	Description
	ValueError	If the specified split is not found.
	FileNotFoundError	If no data files are found at the path.
	KeyError	If dataset not found in index.

Examples

>>> # Load without type - get DictSample for exploration
>>> ds = load_dataset("./data/train.tar", split="train")
>>> for sample in ds.ordered():
...     print(sample.keys())  # Explore fields
...     print(sample["text"]) # Dict-style access
...     print(sample.label)   # Attribute access
>>>
>>> # Convert to typed schema
>>> typed_ds = ds.as_type(TextData)
>>>
>>> # Or load with explicit type directly
>>> train_ds = load_dataset("./data/train-*.tar", TextData, split="train")
>>>
>>> # Load from index with auto-type resolution
>>> index = LocalIndex()
>>> ds = load_dataset("@local/my-dataset", index=index, split="train")