S3Source
S3Source(
bucket,
keys,
endpoint= None ,
access_key= None ,
secret_key= None ,
region= None ,
_client= None ,
)
Data source for S3-compatible storage with explicit credentials.
Uses boto3 to stream directly from S3, supporting: - Standard AWS S3 - S3-compatible endpoints (Cloudflare R2, MinIO, etc.) - Private buckets with credentials - IAM role authentication (when keys not provided)
Unlike URL-based approaches, this doesn’t require URL transformation or global gopen_schemes registration. Credentials are scoped to the source instance.
Attributes
bucket
str
S3 bucket name.
keys
list [str ]
List of object keys (paths within bucket).
endpoint
str | None
Optional custom endpoint URL for S3-compatible services.
access_key
str | None
Optional AWS access key ID.
secret_key
str | None
Optional AWS secret access key.
region
str | None
Optional AWS region (defaults to us-east-1).
Examples
>>> source = S3Source(
... bucket= "my-datasets" ,
... keys= ["train/shard-000.tar" , "train/shard-001.tar" ],
... endpoint= "https://abc123.r2.cloudflarestorage.com" ,
... access_key= "AKIAIOSFODNN7EXAMPLE" ,
... secret_key= "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" ,
... )
>>> for shard_id, stream in source.shards:
... process(stream)
Methods
from_credentials
S3Source.from_credentials(credentials, bucket, keys)
Create S3Source from a credentials dictionary.
Accepts the same credential format used by S3DataStore.
Parameters
credentials
dict [str , str ]
Dict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.
required
bucket
str
S3 bucket name.
required
keys
list [str ]
List of object keys.
required
Returns
'S3Source'
Configured S3Source.
Examples
>>> creds = {
... "AWS_ACCESS_KEY_ID" : "..." ,
... "AWS_SECRET_ACCESS_KEY" : "..." ,
... "AWS_ENDPOINT" : "https://r2.example.com" ,
... }
>>> source = S3Source.from_credentials(creds, "my-bucket" , ["data.tar" ])
from_urls
S3Source.from_urls(
urls,
* ,
endpoint= None ,
access_key= None ,
secret_key= None ,
region= None ,
)
Create S3Source from s3:// URLs.
Parses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.
Parameters
urls
list [str ]
List of s3:// URLs.
required
endpoint
str | None
Optional custom endpoint.
None
access_key
str | None
Optional access key.
None
secret_key
str | None
Optional secret key.
None
region
str | None
Optional region.
None
Returns
'S3Source'
S3Source configured for the given URLs.
Raises
ValueError
If URLs are not valid s3:// URLs or span multiple buckets.
Examples
>>> source = S3Source.from_urls(
... ["s3://my-bucket/train-000.tar" , "s3://my-bucket/train-001.tar" ],
... endpoint= "https://r2.example.com" ,
... )
list_shards
Return list of S3 URIs for the shards.
open_shard
S3Source.open_shard(shard_id)
Open a single shard by S3 URI.
Parameters
shard_id
str
S3 URI of the shard (s3://bucket/key).
required
Returns
IO [bytes ]
StreamingBody for reading the object.
Raises
KeyError
If shard_id is not in list_shards().