S3FileLister¶
- class torchdata.datapipes.iter.S3FileLister(source_datapipe: IterDataPipe[str], length: int = -1, request_timeout_ms=-1, region='')¶
Iterable DataPipe that lists Amazon S3 file URLs with the given prefixes (functional name:
list_files_by_s3). Acceptable prefixes includes3://bucket-name,s3://bucket-name/,s3://bucket-name/folder.Note
source_datapipemust contain a list of valid S3 URLslengthis -1 by default, and any call to__len__()is invalid, because the length is unknown until all files are iterated.request_timeout_msandregionwill overwrite settings in the configuration file or environment variables.
- Parameters:
source_datapipe – a DataPipe that contains URLs/URL prefixes to s3 files
length – Nominal length of the datapipe
request_timeout_ms – timeout setting for each reqeust (3,000ms by default)
region – region for access files (inferred from credentials by default)
Example
>>> from torchdata.datapipes.iter import IterableWrapper, S3FileLister >>> s3_prefixes = IterableWrapper(['s3://bucket-name/folder/', ...]) >>> dp_s3_urls = S3FileLister(s3_prefixes) >>> for d in dp_s3_urls: ... pass # Functional API >>> dp_s3_urls = s3_prefixes.list_files_by_s3(request_timeout_ms=100) >>> for d in dp_s3_urls: ... pass