DatasetDict

DatasetDict(splits=None, sample_type=None, streaming=False)

A dictionary of split names to Dataset instances.

Similar to HuggingFace’s DatasetDict, this provides a container for multiple dataset splits (train, test, validation, etc.) with convenience methods that operate across all splits.

Parameters

Name Type Description Default
ST The sample type for all datasets in this dict. required

Examples

>>> ds_dict = load_dataset("path/to/data", MyData)
>>> train = ds_dict["train"]
>>> test = ds_dict["test"]
>>>
>>> # Iterate over all splits
>>> for split_name, dataset in ds_dict.items():
...     print(f"{split_name}: {len(dataset.shard_list)} shards")

Attributes

Name Description
num_shards Number of shards in each split.
sample_type The sample type for datasets in this dict.
streaming Whether this DatasetDict was loaded in streaming mode.