Dataset

Dataset(source=None, metadata_url=None, *, url=None)

A typed dataset built on WebDataset with lens transformations.

This class wraps WebDataset tar archives and provides type-safe iteration over samples of a specific PackableSample type. Samples are stored as msgpack-serialized data within WebDataset shards.

The dataset supports: - Ordered and shuffled iteration - Automatic batching with SampleBatch - Type transformations via the lens system (as_type()) - Export to parquet format

Parameters

Name Type Description Default
ST The sample type for this dataset, must derive from PackableSample. required

Attributes

Name Type Description
url WebDataset brace-notation URL for the tar file(s).

Examples

>>> ds = Dataset[MyData]("path/to/data-{000000..000009}.tar")
>>> for sample in ds.ordered(batch_size=32):
...     # sample is SampleBatch[MyData] with batch_size samples
...     embeddings = sample.embeddings  # shape: (32, ...)
...
>>> # Transform to a different view
>>> ds_view = ds.as_type(MyDataView)

Note

This class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class.

Methods

Name Description
as_type View this dataset through a different sample type using a registered lens.
list_shards Get list of individual dataset shards.
ordered Iterate over the dataset in order.
shuffled Iterate over the dataset in random order.
to_parquet Export dataset contents to parquet format.
wrap Wrap a raw msgpack sample into the appropriate dataset-specific type.
wrap_batch Wrap a batch of raw msgpack samples into a typed SampleBatch.

as_type

Dataset.as_type(other)

View this dataset through a different sample type using a registered lens.

Parameters

Name Type Description Default
other Type[RT] The target sample type to transform into. Must be a type derived from PackableSample. required

Returns

Name Type Description
Dataset[RT] A new Dataset instance that yields samples of type other
Dataset[RT] by applying the appropriate lens transformation from the global
Dataset[RT] LensNetwork registry.

Raises

Name Type Description
ValueError If no registered lens exists between the current sample type and the target type.

list_shards

Dataset.list_shards()

Get list of individual dataset shards.

Returns

Name Type Description
list[str] A full (non-lazy) list of the individual tar files within the
list[str] source WebDataset.

ordered

Dataset.ordered(batch_size=None)

Iterate over the dataset in order.

Parameters

Name Type Description Default
batch_size int | None The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension. None

Returns

Name Type Description
Iterable[ST] | Iterable[SampleBatch[ST]] A data pipeline that iterates over the dataset in its original
Iterable[ST] | Iterable[SampleBatch[ST]] sample order. When batch_size is None, yields individual
Iterable[ST] | Iterable[SampleBatch[ST]] samples of type ST. When batch_size is an integer, yields
Iterable[ST] | Iterable[SampleBatch[ST]] SampleBatch[ST] instances containing that many samples.

Examples

>>> for sample in ds.ordered():
...     process(sample)  # sample is ST
>>> for batch in ds.ordered(batch_size=32):
...     process(batch)  # batch is SampleBatch[ST]

shuffled

Dataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)

Iterate over the dataset in random order.

Parameters

Name Type Description Default
buffer_shards int Number of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100. 100
buffer_samples int Number of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000. 10000
batch_size int | None The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension. None

Returns

Name Type Description
Iterable[ST] | Iterable[SampleBatch[ST]] A data pipeline that iterates over the dataset in randomized order.
Iterable[ST] | Iterable[SampleBatch[ST]] When batch_size is None, yields individual samples of type
Iterable[ST] | Iterable[SampleBatch[ST]] ST. When batch_size is an integer, yields SampleBatch[ST]
Iterable[ST] | Iterable[SampleBatch[ST]] instances containing that many samples.

Examples

>>> for sample in ds.shuffled():
...     process(sample)  # sample is ST
>>> for batch in ds.shuffled(batch_size=32):
...     process(batch)  # batch is SampleBatch[ST]

to_parquet

Dataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)

Export dataset contents to parquet format.

Converts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.

Parameters

Name Type Description Default
path Pathlike Output path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet. required
sample_map Optional[SampleExportMap] Optional function to convert samples to dictionaries. Defaults to dataclasses.asdict. None
maxcount Optional[int] If specified, split output into multiple files with at most this many samples each. Recommended for large datasets. None
**kwargs Additional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine. {}

Warning

Memory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.

For datasets larger than available RAM, always specify maxcount::

# Safe for large datasets - processes in chunks
ds.to_parquet("output.parquet", maxcount=10000)

This creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.

Examples

>>> ds = Dataset[MySample]("data.tar")
>>> # Small dataset - load all at once
>>> ds.to_parquet("output.parquet")
>>>
>>> # Large dataset - process in chunks
>>> ds.to_parquet("output.parquet", maxcount=50000)

wrap

Dataset.wrap(sample)

Wrap a raw msgpack sample into the appropriate dataset-specific type.

Parameters

Name Type Description Default
sample WDSRawSample A dictionary containing at minimum a 'msgpack' key with serialized sample bytes. required

Returns

Name Type Description
ST A deserialized sample of type ST, optionally transformed through
ST a lens if as_type() was called.

wrap_batch

Dataset.wrap_batch(batch)

Wrap a batch of raw msgpack samples into a typed SampleBatch.

Parameters

Name Type Description Default
batch WDSRawBatch A dictionary containing a 'msgpack' key with a list of serialized sample bytes. required

Returns

Name Type Description
SampleBatch[ST] A SampleBatch[ST] containing deserialized samples, optionally
SampleBatch[ST] transformed through a lens if as_type() was called.

Note

This implementation deserializes samples one at a time, then aggregates them into a batch.