URLSource

URLSource(url)

Data source for WebDataset-compatible URLs.

Wraps WebDataset’s gopen to open URLs using built-in handlers for http, https, pipe, gs, hf, sftp, etc. Supports brace expansion for shard patterns like “data-{000..099}.tar”.

This is the default source type when a string URL is passed to Dataset.

Attributes

Name Type Description
url str URL or brace pattern for the shards.

Examples

>>> source = URLSource("https://example.com/train-{000..009}.tar")
>>> for shard_id, stream in source.shards:
...     print(f"Streaming {shard_id}")

Methods

Name Description
list_shards Expand brace pattern and return list of shard URLs.
open_shard Open a single shard by URL.

list_shards

URLSource.list_shards()

Expand brace pattern and return list of shard URLs.

open_shard

URLSource.open_shard(shard_id)

Open a single shard by URL.

Parameters

Name Type Description Default
shard_id str URL of the shard to open. required

Returns

Name Type Description
IO[bytes] File-like stream from gopen.

Raises

Name Type Description
KeyError If shard_id is not in list_shards().