Dataset
Dataset(source=None, metadata_url=None, *, url=None)A typed dataset built on WebDataset with lens transformations.
This class wraps WebDataset tar archives and provides type-safe iteration over samples of a specific PackableSample type. Samples are stored as msgpack-serialized data within WebDataset shards.
The dataset supports: - Ordered and shuffled iteration - Automatic batching with SampleBatch - Type transformations via the lens system (as_type()) - Export to parquet format
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| ST | The sample type for this dataset, must derive from PackableSample. |
required |
Attributes
| Name | Type | Description |
|---|---|---|
| url | WebDataset brace-notation URL for the tar file(s). |
Examples
>>> ds = Dataset[MyData]("path/to/data-{000000..000009}.tar")
>>> for sample in ds.ordered(batch_size=32):
... # sample is SampleBatch[MyData] with batch_size samples
... embeddings = sample.embeddings # shape: (32, ...)
...
>>> # Transform to a different view
>>> ds_view = ds.as_type(MyDataView)Note
This class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class.
Methods
| Name | Description |
|---|---|
| as_type | View this dataset through a different sample type using a registered lens. |
| list_shards | Get list of individual dataset shards. |
| ordered | Iterate over the dataset in order. |
| shuffled | Iterate over the dataset in random order. |
| to_parquet | Export dataset contents to parquet format. |
| wrap | Wrap a raw msgpack sample into the appropriate dataset-specific type. |
| wrap_batch | Wrap a batch of raw msgpack samples into a typed SampleBatch. |
as_type
Dataset.as_type(other)View this dataset through a different sample type using a registered lens.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| other | Type[RT] | The target sample type to transform into. Must be a type derived from PackableSample. |
required |
Returns
| Name | Type | Description |
|---|---|---|
| Dataset[RT] | A new Dataset instance that yields samples of type other |
|
| Dataset[RT] | by applying the appropriate lens transformation from the global | |
| Dataset[RT] | LensNetwork registry. |
Raises
| Name | Type | Description |
|---|---|---|
| ValueError | If no registered lens exists between the current sample type and the target type. |
list_shards
Dataset.list_shards()Get list of individual dataset shards.
Returns
| Name | Type | Description |
|---|---|---|
| list[str] | A full (non-lazy) list of the individual tar files within the |
|
| list[str] | source WebDataset. |
ordered
Dataset.ordered(batch_size=None)Iterate over the dataset in order.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| batch_size | int | None | The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension. |
None |
Returns
| Name | Type | Description |
|---|---|---|
| Iterable[ST] | Iterable[SampleBatch[ST]] | A data pipeline that iterates over the dataset in its original | |
| Iterable[ST] | Iterable[SampleBatch[ST]] | sample order. When batch_size is None, yields individual |
|
| Iterable[ST] | Iterable[SampleBatch[ST]] | samples of type ST. When batch_size is an integer, yields |
|
| Iterable[ST] | Iterable[SampleBatch[ST]] | SampleBatch[ST] instances containing that many samples. |
Examples
>>> for sample in ds.ordered():
... process(sample) # sample is ST
>>> for batch in ds.ordered(batch_size=32):
... process(batch) # batch is SampleBatch[ST]shuffled
Dataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)Iterate over the dataset in random order.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| buffer_shards | int | Number of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100. | 100 |
| buffer_samples | int | Number of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000. | 10000 |
| batch_size | int | None | The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension. |
None |
Returns
| Name | Type | Description |
|---|---|---|
| Iterable[ST] | Iterable[SampleBatch[ST]] | A data pipeline that iterates over the dataset in randomized order. | |
| Iterable[ST] | Iterable[SampleBatch[ST]] | When batch_size is None, yields individual samples of type |
|
| Iterable[ST] | Iterable[SampleBatch[ST]] | ST. When batch_size is an integer, yields SampleBatch[ST] |
|
| Iterable[ST] | Iterable[SampleBatch[ST]] | instances containing that many samples. |
Examples
>>> for sample in ds.shuffled():
... process(sample) # sample is ST
>>> for batch in ds.shuffled(batch_size=32):
... process(batch) # batch is SampleBatch[ST]to_parquet
Dataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)Export dataset contents to parquet format.
Converts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| path | Pathlike | Output path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet. |
required |
| sample_map | Optional[SampleExportMap] | Optional function to convert samples to dictionaries. Defaults to dataclasses.asdict. |
None |
| maxcount | Optional[int] | If specified, split output into multiple files with at most this many samples each. Recommended for large datasets. | None |
| **kwargs | Additional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine. |
{} |
Warning
Memory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.
For datasets larger than available RAM, always specify maxcount::
# Safe for large datasets - processes in chunks
ds.to_parquet("output.parquet", maxcount=10000)
This creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.
Examples
>>> ds = Dataset[MySample]("data.tar")
>>> # Small dataset - load all at once
>>> ds.to_parquet("output.parquet")
>>>
>>> # Large dataset - process in chunks
>>> ds.to_parquet("output.parquet", maxcount=50000)wrap
Dataset.wrap(sample)Wrap a raw msgpack sample into the appropriate dataset-specific type.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| sample | WDSRawSample | A dictionary containing at minimum a 'msgpack' key with serialized sample bytes. |
required |
Returns
| Name | Type | Description |
|---|---|---|
| ST | A deserialized sample of type ST, optionally transformed through |
|
| ST | a lens if as_type() was called. |
wrap_batch
Dataset.wrap_batch(batch)Wrap a batch of raw msgpack samples into a typed SampleBatch.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| batch | WDSRawBatch | A dictionary containing a 'msgpack' key with a list of serialized sample bytes. |
required |
Returns
| Name | Type | Description |
|---|---|---|
| SampleBatch[ST] | A SampleBatch[ST] containing deserialized samples, optionally |
|
| SampleBatch[ST] | transformed through a lens if as_type() was called. |
Note
This implementation deserializes samples one at a time, then aggregates them into a batch.