Dataset

Dataset(source=None, metadata_url=None, *, url=None)

A typed dataset built on WebDataset with lens transformations.

This class wraps WebDataset tar archives and provides type-safe iteration over samples of a specific PackableSample type. Samples are stored as msgpack-serialized data within WebDataset shards.

The dataset supports: - Ordered and shuffled iteration - Automatic batching with SampleBatch - Type transformations via the lens system (as_type()) - Export to parquet format

Parameters

Name	Type	Description	Default
ST		The sample type for this dataset, must derive from `PackableSample`.	required

Attributes

Name	Type	Description
url		WebDataset brace-notation URL for the tar file(s).

Examples

>>> ds = Dataset[MyData]("path/to/data-{000000..000009}.tar")
>>> for sample in ds.ordered(batch_size=32):
...     # sample is SampleBatch[MyData] with batch_size samples
...     embeddings = sample.embeddings  # shape: (32, ...)
...
>>> # Transform to a different view
>>> ds_view = ds.as_type(MyDataView)

Note

This class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class.

Methods

Name	Description
as_type	View this dataset through a different sample type using a registered lens.
list_shards	Get list of individual dataset shards.
ordered	Iterate over the dataset in order.
shuffled	Iterate over the dataset in random order.
to_parquet	Export dataset contents to parquet format.
wrap	Wrap a raw msgpack sample into the appropriate dataset-specific type.
wrap_batch	Wrap a batch of raw msgpack samples into a typed SampleBatch.

as_type

Dataset.as_type(other)

View this dataset through a different sample type using a registered lens.

Parameters

Name	Type	Description	Default
other	Type[RT]	The target sample type to transform into. Must be a type derived from `PackableSample`.	required

Returns

Name	Type	Description
	Dataset[RT]	A new `Dataset` instance that yields samples of type `other`
	Dataset[RT]	by applying the appropriate lens transformation from the global
	Dataset[RT]	`LensNetwork` registry.

Raises

Name	Type	Description
	ValueError	If no registered lens exists between the current sample type and the target type.

list_shards

Dataset.list_shards()

Get list of individual dataset shards.

Returns

Name	Type	Description
	list[str]	A full (non-lazy) list of the individual `tar` files within the
	list[str]	source WebDataset.

ordered

Dataset.ordered(batch_size=None)

Iterate over the dataset in order.

Parameters

Name	Type	Description	Default
batch_size	int \| None	The size of iterated batches. Default: None (unbatched). If `None`, iterates over one sample at a time with no batch dimension.	`None`

Returns

Name	Type	Description
	Iterable[ST] \| Iterable[SampleBatch[ST]]	A data pipeline that iterates over the dataset in its original
	Iterable[ST] \| Iterable[SampleBatch[ST]]	sample order. When `batch_size` is `None`, yields individual
	Iterable[ST] \| Iterable[SampleBatch[ST]]	samples of type `ST`. When `batch_size` is an integer, yields
	Iterable[ST] \| Iterable[SampleBatch[ST]]	`SampleBatch[ST]` instances containing that many samples.

Examples

>>> for sample in ds.ordered():
...     process(sample)  # sample is ST
>>> for batch in ds.ordered(batch_size=32):
...     process(batch)  # batch is SampleBatch[ST]

shuffled

Dataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)

Iterate over the dataset in random order.

Parameters

Name	Type	Description	Default
buffer_shards	int	Number of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.	`100`
buffer_samples	int	Number of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.	`10000`
batch_size	int \| None	The size of iterated batches. Default: None (unbatched). If `None`, iterates over one sample at a time with no batch dimension.	`None`

Returns

Name	Type	Description
	Iterable[ST] \| Iterable[SampleBatch[ST]]	A data pipeline that iterates over the dataset in randomized order.
	Iterable[ST] \| Iterable[SampleBatch[ST]]	When `batch_size` is `None`, yields individual samples of type
	Iterable[ST] \| Iterable[SampleBatch[ST]]	`ST`. When `batch_size` is an integer, yields `SampleBatch[ST]`
	Iterable[ST] \| Iterable[SampleBatch[ST]]	instances containing that many samples.

Examples

>>> for sample in ds.shuffled():
...     process(sample)  # sample is ST
>>> for batch in ds.shuffled(batch_size=32):
...     process(batch)  # batch is SampleBatch[ST]

to_parquet

Dataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)

Export dataset contents to parquet format.

Converts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.

Parameters

Name	Type	Description	Default
path	Pathlike	Output path for the parquet file. If `maxcount` is specified, files are named `{stem}-{segment:06d}.parquet`.	required
sample_map	Optional[SampleExportMap]	Optional function to convert samples to dictionaries. Defaults to `dataclasses.asdict`.	`None`
maxcount	Optional[int]	If specified, split output into multiple files with at most this many samples each. Recommended for large datasets.	`None`
**kwargs		Additional arguments passed to `pandas.DataFrame.to_parquet()`. Common options include `compression`, `index`, `engine`.	`{}`

Warning

Memory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.

For datasets larger than available RAM, always specify maxcount::

# Safe for large datasets - processes in chunks
ds.to_parquet("output.parquet", maxcount=10000)

This creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.

Examples

>>> ds = Dataset[MySample]("data.tar")
>>> # Small dataset - load all at once
>>> ds.to_parquet("output.parquet")
>>>
>>> # Large dataset - process in chunks
>>> ds.to_parquet("output.parquet", maxcount=50000)

wrap

Dataset.wrap(sample)

Wrap a raw msgpack sample into the appropriate dataset-specific type.

Parameters

Name	Type	Description	Default
sample	WDSRawSample	A dictionary containing at minimum a `'msgpack'` key with serialized sample bytes.	required

Returns

Name	Type	Description
	ST	A deserialized sample of type `ST`, optionally transformed through
	ST	a lens if `as_type()` was called.

wrap_batch

Dataset.wrap_batch(batch)

Wrap a batch of raw msgpack samples into a typed SampleBatch.

Parameters

Name	Type	Description	Default
batch	WDSRawBatch	A dictionary containing a `'msgpack'` key with a list of serialized sample bytes.	required

Returns

Name	Type	Description
	SampleBatch[ST]	A `SampleBatch[ST]` containing deserialized samples, optionally
	SampleBatch[ST]	transformed through a lens if `as_type()` was called.

Note

This implementation deserializes samples one at a time, then aggregates them into a batch.