be4f3d8 -> HEAD @ 2024-05-13 14:57:35 -0400
- FAQ Generation: Introduced an automatic FAQ generation script that summarizes issues into a
FAQ.md
file using AI. This includes tasks for creating and summarizing FAQ entries. - FluidInterface Class: Fixed argument passing in the
listed
method to correctly passbatchsize
andcollation_fn
to thefilters.batched
function. - Decoder Class: Added an assertion to ensure the
handlers
parameter is a list during initialization. - FileCache Class: Ensured the file stream is yielded correctly after handling exceptions, and each cached file is returned only once.
- WebDataset: Added a warning in
compat.py
to notify users whenshardshuffle
is set toNone
, advising them to set it explicitly toFalse
or a number. - Tariterators: Added an EOF signal at the end of each tarfile to terminate the accumulation of samples, preventing mixing of samples from different shards.
- Shuffle Function: Corrected the random seed initialization in the
_shuffle
function to use the provided seed.
f11fd66 -> be4f3d8 @ 2024-03-13 14:39:27 -0700
- Introduced an
empty_check
option toWebDataset
andResampledShards
to handle cases where no shards are available, raising aValueError
if no samples are found. - Modified
webdataset/shardlists.py
to fix the order of arguments inResampledShardlist
. - Added a new test
test_check_empty_throws_ValueError
to ensure theempty_check
functionality works as expected. - Updated
test_fluid.py
to skip a test due to inaccessible remote data. - Enhanced
DataPipeline
to stop looping if no samples are generated. - Added multiple untracked files and renamed
COMMITS.md
toVERSIONS
.
457c4ce -> f11fd66 @ 2023-12-06 21:21:25 -0800
- Introduced new
train-resnet50-wids.py
example script for training ResNet50 with WIDS. - Enhanced
tasks.py
with additional tasks for testing, notebook processing, and Docker builds. - Added comprehensive test coverage for various modules including
wids_dl
,wids_mmtar
,wids_cleanup
, and more. - Implemented
LRUCleanup
class for efficient cache management. - Improved
FileCache
andStreamingOpen
classes for better file handling and caching. - Added
RandomShardDownloader
class for downloading shards randomly from a source directory. - Enhanced
WebDataset
class with better URL handling and cache management. - Introduced
ChunkedSampler
andDistributedChunkedSampler
for efficient data sampling in distributed environments. - Added
keep_most_recent_files
function andExclusiveLock
class for file management and locking. - Improved
download_file
anddownload_and_open
functions for better file downloading and handling. - Enhanced
MMIndexedTar
class with better memory-mapped file handling and cleanup callbacks. - Updated
ShardListDataset
andLRUShards
classes for better shard management and caching. - Added new utility functions for handling file patterns and deprecation warnings.
404b538 -> 457c4ce @ 2023-12-06 00:09:39 -0800
- Introduced a new
cleanup
task intasks.py
that runsautoflake
,isort
, andblack
for code formatting and cleanup. - Added type annotations to various test functions in
tests/test_wids.py
to improve code clarity and type checking. - Enhanced
ShardListDataset
andShardedSampler
classes inwids/wids.py
with type annotations and additional docstrings for better code documentation and readability. - Improved the
compute_file_md5sum
function inwids/wids.py
to handle both filenames and file objects, with added examples in the docstring. - Refactored several functions to use
yield from
for more concise and efficient code. - Added new test cases for
ShardedSampler
intests/test_wids.py
to ensure proper functionality and uniqueness of indexes. - Updated
resolve_dsdesc
andload_dsdesc_and_resolve
functions inwids/wids_specs.py
to handleNone
options more gracefully. - Improved error handling and logging in various modules, including
webdataset/filters.py
andwebdataset/gopen.py
. - Enhanced the
TarWriter
class inwebdataset/writer.py
with better handling of file objects and compression options.
065208d -> 404b538 @ 2023-12-05 16:28:26 -0800
- Bug Fixes and Improvements:
- Fixed bugs in local name defaulting and other areas.
- Enhanced
ShardListDataset
to handle cache directories and local names more flexibly. - Improved
hash_localname
anddefault_localname
functions to better manage shard URLs. - Added
cache_localname
function for better cache management. - Updated tests in
test_wids.py
to reflect changes in shard handling and caching. - Ensured
rebase_shardlist
andresolve_dsdesc
functions handle shard lists correctly. - Improved error handling and assertions in
wids_specs.py
.
81bdb5c -> 065208d @ 2023-12-05 02:28:06 -0800
- Refactored: Improved URL handling and caching mechanisms in
wids.py
by introducinghash_localname
and enhancingdefault_localname
to use URL-safe quoting. - Enhanced: Added support for dataset name hashing and base64 encoding for cache file names.
- Updated:
ShardListDataset
class to handle dataset descriptors with nested datasets and improved shard URL resolution. - Improved:
wids_index.py
to provide more detailed dataset information, including nested datasets. - Refined:
wids_specs.py
to better handle remote dataset descriptions, rebase shard URLs, and resolve nested dataset references. - Fixed: Various issues related to shard indexing and cache management.
efb4a1e -> 81bdb5c @ 2023-11-22 08:43:16 -0800
-
Refactored Directory Structure:
- Moved
webdataset/tests
totests
. - Moved
webdataset/wids
towids
.
- Moved
-
New Features:
- Added
widsindex
command line program. - Introduced
pipx
task intasks.py
for installing packages withpipx
. - Added Docker support for local testing with
dockerlocal
task.
- Added
-
Enhancements:
- Updated
setup.py
to reflect new package structure. - Improved
ShardListDataset
to support cache directories and additional options. - Enhanced
wids_dl
to support file copying and improved URL handling. - Added
AtomicJsonUpdate
for safe JSON file updates. - Improved shard list extraction and URL merging in
wids_specs
.
- Updated
-
Bug Fixes:
- Fixed MD5 sum mismatch assertion in
IndexedTarSamples
. - Corrected cache miss rate warning in
ShardListDataset
. - Ensured proper handling of transformations in
ShardListDataset
.
- Fixed MD5 sum mismatch assertion in
-
Miscellaneous:
- Updated Docker base image to
ubuntu:22.04
. - Improved handling of remote and local dataset descriptions.
- Updated Docker base image to
8e52bca -> efb4a1e @ 2023-11-14 14:13:38 -0800
- Made debugging output in
wids
optional by adding checks for theWIDS_VERBOSE
environment variable. - Updated
group_by_key
function inwids.py
to print a warning message for ignored files. - Enhanced
LRUShards
andShardListDataset
classes to conditionally print debugging information based on theWIDS_VERBOSE
environment variable. - Minor adjustments in
setup.py
and__init__.py
to reflect the latest changes.
09db99d -> 8e52bca @ 2023-11-14 14:12:56 -0800
- Added
testdata/testgz.tar
binary file. - Modified
setup.py
andwebdataset/__init__.py
to reflect new version. - Adjusted file paths and imports in
webdataset/__init__.py
.
3959abb -> 09db99d @ 2023-11-09 12:31:00 -0800
- Added decompression support for
.gz
files inwebdataset/wids/wids.py
. - Introduced a new test class
TestGz
inwebdataset/tests/test_wids.py
to verify the decompression functionality. - Updated the
default_decoder
function to handle.gz
file extensions, decompressing them and processing the underlying content. - Modified
webdataset/__init__.py
to reflect the new changes.
d46f93a -> 3959abb @ 2023-11-08 10:26:20 -0800
TarWriter
now automatically compresses files ending in.gz
.- Added a new test
test_writer_gz
inwebdataset/tests/test_writer.py
to verify the automatic compression feature. - Updated
encode_based_on_extension1
inwebdataset/writer.py
to handle.gz
file compression using thegzip
module.
039f70f -> d46f93a @ 2023-11-01 12:18:20 -0700
- Introduced a context manager to the
WebDataset
class, allowing it to be used withwith
statements for automatic resource management. - Added a
close
method to theDataPipeline
class to ensure proper cleanup of pipeline stages. - Updated tests to include a context manager usage example for
WebDataset
. - Enhanced
webdataset/compat.py
andwebdataset/pipeline.py
to support the new context manager functionality.
e7507c9 -> 039f70f @ 2023-10-31 11:02:27 -0700
- Test Cases: Added new test cases in
webdataset/tests/test_wids.py
to cover specifications parsing and validation. - Functionality: Moved shard list loading and validation functions to a new file
webdataset/wids/wids_specs.py
for better modularity. - Code Refactoring: Refactored
ShardListDataset
class inwebdataset/wids/wids.py
to use the newload_remote_shardlist
function fromwids_specs
. - Bug Fixes: Fixed issues related to shard list extraction and validation, ensuring proper handling of nested datasets and remote sources.
3c40a3e -> e7507c9 @ 2023-10-30 10:55:47 -0700
- Fixed the indexing script to handle file input from stdin and expanded brace expressions in filenames.
- Improved
ShardListDataset
class to correctly load and print shard lists, and return the dataset object when adding transformations. - Enhanced
load_remote_shardlist
function to handle string inputs for file paths. - Added JSON import to
wids.py
for better handling of dataset descriptions. - Updated
wids_index.py
to includewids_version
in the result dictionary and conditionally add the dataset name.
a02f440 -> 3c40a3e @ 2023-10-30 10:15:54 -0700
- Improved handling of default decoders in
webdataset/wids/wids.py
:- Introduced
functools.partial
to streamline decoder format selection. - Added error handling for unknown formats in the
default_decoder
function. - Updated
ShardListDataset
to supportPIL
andnumpy
transformations directly.
- Introduced
- Modified
setup.py
andwebdataset/__init__.py
to reflect the new changes. - Enhanced image decoding logic to raise errors for unknown formats and ensure proper format handling.
4bb1a6b -> a02f440 @ 2023-10-28 20:09:23 -0700
- Removed a debugging statement from
webdataset/wids/wids_mmtar.py
. - Updated
setup.py
andwebdataset/__init__.py
to reflect the latest changes. - Minor adjustments in
VERSION
andsetup.py
files. - Overall, the changes include 3 insertions and 4 deletions across 4 files.
5dc6332 -> 4bb1a6b @ 2023-10-25 13:04:31 -0700
- Introduced
wids_mmtar.py
andwids_tar.py
to factor out and enhance tar file handling with memory-mapped tar file support. - Updated
tasks.py
to remove unnecessary git status check innewversion
function. - Modified
webdataset/wids/wids.py
to integrate new tar file handling classes and improve tar file indexing and reading. - Changed
webdataset/autodecode.py
to ensure compatibility withtorch
tensor conversion. - Updated
webdataset/tests/test_decode.py
to useimageio.v3.imread
for image reading. - Enhanced
ShardListDataset
andShardedSampler
classes inwebdataset/wids/wids.py
for better shard handling and sampling.
c0e388d -> 5dc6332 @ 2023-10-20 13:25:46 -0700
- Introduced a
default_decoder
function to handle common file extensions inwebdataset
. - Added index file caching in
TarFileReader
to improve performance. - Enhanced
IndexedTarSamples
andShardListDataset
to support index files and transformations. - Fixed bugs in
wids
and improved the handling of tar file indexing. - Updated
tasks.py
to automate version tagging and pushing to GitHub. - Added a new notebook
wids_mnist.ipynb
for demonstration purposes.
977ee91 -> c0e388d @ 2023-10-18 11:54:21 -0700
-
Enhancements:
- Improved the docstring and argument list of all curried functions in
webdataset/filters.py
. - Added
functools.update_wrapper
to thepipelinefilter
decorator for better function wrapping.
- Improved the docstring and argument list of all curried functions in
-
Bug Fixes:
- Fixed issues in
setup.py
andwebdataset/__init__.py
to ensure proper package inclusion and functionality.
- Fixed issues in
fe15d64 -> 977ee91 @ 2023-10-17 22:48:40 -0700
- Updated
setup.py
to reflect changes in the package. - Modified
webdataset/__init__.py
to ensure consistency with the new version.
b7be4da -> fe15d64 @ 2023-10-17 21:26:27 -0700
- Updated the
publish
action in.github/workflows/pypi.yml
. - Enhanced
webdataset/wids/wids.py
with additional docstrings for better code documentation and understanding. - Introduced new methods in
ShardListDataset
andShardedSampler
classes to improve functionality and provide detailed descriptions of their purposes. - Added a new function
check_shards
to validate the structure of shard lists. - Improved the
ShardListDataset
class with methods for cache management, shard retrieval, and sample access, ensuring efficient data handling and locality preservation.
b092eb6 -> b7be4da @ 2023-09-21 16:17:57 -0700
- Merged the
main
branch from the remote repository. - Refactored the
wids
module, including moving several files to a newwids
directory. - Added functionality for reading JSON
wids
inwebdataset
. - Improved the
tasks.py
script with better formatting and additional functionality. - Enhanced the
FluidInterface
class inwebdataset/compat.py
with better handling of batch processing and decoding. - Updated various test files to reflect changes in the
wids
module structure. - Introduced new functions for loading remote shard lists and extracting shard lists from dataset descriptions in
webdataset/wids/wids.py
. - Added a new
__init__.py
file in thewids
directory to facilitate module imports. - Improved the
wids_bench.py
script for better handling of dataset descriptions and command-line arguments. - Updated the
wids_index.py
script to enhance shard indexing functionality.
abc1a5d -> b092eb6 @ 2023-09-21 15:54:07 -0700
- ShardWriter: Added a
verbose
parameter to theShardWriter
class initializer, allowing users to set the verbosity level. The default value is set to1
. This change provides more control over the logging output during the shard writing process.
e54effd -> abc1a5d @ 2023-09-12 16:23:51 -0700
- Improved
pipe_cleaner
function inwebdataset/cache.py
to handlehdfs
URLs. - Made
DataPipeline.compose
method inwebdataset/pipeline.py
non-destructive by copying the pipeline stages. - Updated
ShardWriter
inwebdataset/writer.py
to useTarWriter
directly with the filename, ensuring proper handling of tar files.
faa774e -> e54effd @ 2023-06-11 21:08:54 -0700
- Introduced new functionality for handling indexed web datasets, including classes like
IndexedTarSamples
,LRUShards
, andShardListDataset
. - Added comprehensive unit tests for the new classes and functionalities, ensuring robust testing coverage.
- Implemented a
ConcurrentDownloader
class to manage concurrent downloads across multiple processes, ensuring only a single download per file. - Added utility functions for computing MD5 checksums and sample counts in tar files.
- Introduced a
wids_index.py
script for creating shard indices for datasets. - Enhanced exception handling in
tariterators.py
to provide more informative error messages. - Added new test files for
wids
,wids_dl
, andwids_lru
to validate the new functionalities. - Included a benchmarking script
wids_bench.py
for performance testing of the new dataset handling mechanisms.
e4c30ef -> faa774e @ 2023-06-11 09:47:53 -0700
- Added a missing
rename_files
argument to thecached_tarfile_samples
function inwebdataset/cache.py
. - Updated the
tar_file_expander
call withincached_tarfile_samples
to include the newrename_files
argument. - Minor adjustments to
setup.py
andwebdataset/__init__.py
to reflect these changes.
039d743 -> e4c30ef @ 2023-03-20 22:08:02 -0700
- Introduced a new
rename_files
argument to theWebDataset
class, allowing for file renaming during dataset processing. - Simplified the
FluidInterface
methods by consolidating multi-line function definitions into single lines. - Enhanced the
WebDataset
initialization to handle YAML file inputs more efficiently. - Improved the
tarfile_to_samples
function to support the newrename_files
argument, providing more flexibility in dataset handling.
352089f -> 039d743 @ 2023-03-18 16:40:58 -0700
-
Enhancements to Tests:
- Added detailed docstrings to test functions in
webdataset/tests/test_fluid.py
for better understanding and documentation. - Improved error handling and assertions in various test cases.
- Introduced new test cases to cover additional scenarios and edge cases.
- Updated
test_loaders.py
to include decoding steps in data pipelines and loaders. - Enhanced
test_webloader_repeat
andtest_webloader_unbatched
to include decoding steps.
- Added detailed docstrings to test functions in
-
Bug Fixes:
- Fixed issues in
test_fluid.py
related to dataset length and sample counting. - Corrected the
test_webloader
function to ensure proper sample counting and batching.
- Fixed issues in
-
Code Cleanup:
- Removed obsolete and untested code sections, marked with
@pytest.mark.skip
. - Refactored import statements in
test_loaders.py
for better readability and maintainability.
- Removed obsolete and untested code sections, marked with
d05d8ff -> 352089f @ 2023-03-14 16:46:58 -0700
- Library and Code Cleanup: Refactored and cleaned up various libraries and code files, including
webdataset
,autodecode
,cache
,cborsiterators
,compat
,extradatasets
,filters
,gopen
,mix
,pipeline
,pytorch
,shardlists
,tariterators
,tests
,utils
, andwriter
. - New Features:
- Added new tasks in
tasks.py
forblack
andautoflake
to format and clean up code. - Introduced new tests for cache, decode, handlers, loaders, mix, pipeline, shuffles, and writer functionalities.
- Implemented file selection and renaming capabilities in
tar_file_iterator
andtar_file_expander
. - Enhanced
torch_loads
function with type annotations and detailed docstrings.
- Added new tasks in
- Bug Fixes and Improvements:
- Fixed various issues related to imports, unused variables, and exception handling.
- Improved the handling of tar file samples and grouping by keys.
- Enhanced the
test_pipeline.py
andtest_fluid.py
with additional test cases and better structure. - Updated
webdataset
to support new decoding and caching mechanisms.
- Testing Enhancements:
- Added comprehensive tests for new features and existing functionalities to ensure robustness and reliability.
- Improved test coverage for various modules, including
cache
,decode
,handlers
,loaders
,mix
,pipeline
,shuffles
, andwriter
.
- Documentation:
- Updated docstrings and comments across multiple files to provide better clarity and understanding of the codebase.
dfa3895 -> d05d8ff @ 2023-03-14 12:29:47 -0700
- Setup.py: Minor adjustments to dependencies and configurations.
- Webdataset Cache: Added
time
module import and improved theget_filetype
function to suppress output from thefile
command. - Webdataset Init: Updated imports and module references for consistency and functionality.
f2b64d8 -> dfa3895 @ 2023-03-09 12:01:27 -0800
- Environment Variable Substitution: Added functionality to substitute environment variables in URLs using the
WDS_
prefix inwebdataset/shardlists.py
. - Cache Directory Validation: Implemented validation for cache directory existence in
webdataset/compat.py
. - Tests: Updated tests to reflect changes in environment variable handling in
webdataset/tests/test_pipeline.py
.
fa40da2 -> f2b64d8 @ 2023-03-03 17:21:19 -0800
- Implemented
expandvars
for URLs inSimpleShardList
to support environment variable expansion. - Fixed a redundant assignment in
ImageHandler
class inautodecode.py
. - Added a test case for
expandvars
intest_pipeline.py
. - Refactored imports in
shardlists.py
for better readability. - Improved error handling and validation in
writer.py
for image encoding and sample writing. - Enhanced
make_handlers
andencode_based_on_extension
functions for better encoding of data samples.
cb1aa32 -> fa40da2 @ 2023-02-01 10:16:51 -0800
- Merged multiple branches and pull requests to address various issues and improvements.
- Added
-f
flag tocurl
commands ingopen.py
to handle failing file opens more gracefully. - Enhanced
lru_cleanup
function incache.py
to handleOSError
andFileNotFoundError
exceptions, ensuring robust file deletion. - Fixed
round_robin_longest
function inmix.py
and added corresponding tests to ensure correct functionality. - Improved error handling in
gopen.py
by raisingIOError
instead of a genericException
. - Added tests for handling missing files in
test_pipeline.py
to ensure proper exception raising.
9bc1eb5 -> cb1aa32 @ 2022-11-29 14:31:29 -0800
- Cleaned up documentation and added
with_length
documentation. - Fixed filename guessing and forward pipe-to-file user functions.
- Added a license file to the install process.
- Added a check for the
file
command. - Added
invoke
to requirements and checkedreadme.ipynb
for runnability. - Improved version handling and added an option to make the
mtime
fixed for reproducibility. - Merged changes from
ShardList
toSimpleShardList
inREADME.md
. - Deleted
wordtrain.py
and other unsupported notebooks. - Added
url_to_name
parameter inWebDataset
class. - Enhanced
group_by_keys
function to handle exceptions and addedmtime
parameter toTarWriter
for reproducible tar files.
8458543 -> 9bc1eb5 @ 2022-11-04 12:49:07 -0700
- Enhanced
TarWriter
: Added a new format option toTarWriter
and changed the default format toUSTAR_FORMAT
. - Image Handling: Expanded the list of supported image extensions in
autodecode.py
to include a comprehensive set of formats supported by Pillow. - Error Handling: Improved error handling in
Decoder
to print a message when UTF-8 decoding fails. - Function Updates: Updated
tar_file_iterator
to pass the handler parameter, and modifiedimageencoder
to support TIFF format. - Bug Fixes: Fixed test errors in
test_gopen
and addressed minor issues in the README.
6864382 -> 8458543 @ 2022-09-16 21:20:12 -0700
- Fixed Import Errors: Added missing
import os
statements inwebdataset/multi.py
andwebdataset/gopen.py
to resolve import errors. - Enhanced
gopen
Functionality: Improvedgopen_file
to handlefile:
URLs correctly and updatedgopen_curl
to use the correctcurl
command for PUT requests. - Updated Tests: Modified tests in
webdataset/tests/test_gopen.py
to use the updatedgopen
function. - Improved Version Handling: Enhanced version handling in
tasks.py
to include running tests before committing changes. - Minor Fixes: Made small changes to
gopen
and other files to improve functionality and reliability.
9f9b0e3 -> 6864382 @ 2022-09-14 15:56:36 -0700
- Improved image handling in
webdataset/autodecode.py
by adding support for different image modes (L
,RGB
,RGBA
) and ensuring proper conversion between modes. - Enhanced numpy and torch array handling for images, including proper dtype conversion and shape assertions.
- Added comprehensive tests in
webdataset/tests/test_pipeline.py
to validate the new image decoding functionality, ensuring compatibility with various image specifications and formats. - Updated dependencies in
setup.py
to ensure compatibility with the new features and improvements.
bcbb408 -> 9f9b0e3 @ 2022-09-14 15:30:12 -0700
- Fixed a typo in
webdataset/__init__.py
by correcting__vesion__
to__version__
. - Corrected an environment variable typo in
webdataset/cache.py
fromos.envrion
toos.environ
. - Added functionality in
tasks.py
to write the updated version towebdataset/__init__.py
during thenewversion
task. - Implemented a bug fix in the codebase.
89a905f -> bcbb408 @ 2022-08-30 20:56:51 -0700
- Introduced support for
GOPEN_VERBOSE
environment variable to control verbosity inwebdataset/cache.py
. - Enhanced cache functionality to respect the
GOPEN_VERBOSE
setting. - Minor adjustments in
setup.py
andwebdataset/cache.py
to improve functionality and maintain consistency.
85c6524 -> 89a905f @ 2022-08-21 21:33:52 -0700
- Improved cache handling by adding
maybe_cached_tarfile_to_samples
function. - Switched from using
hub
togh
for release creation intasks.py
. - Enhanced
FluidInterface
class with better handling forbatched
anddecode
methods. - Updated
WebDataset
class to handle cache size and directory more effectively. - Refined various functions in
filters.py
for better error handling and code clarity. - Enhanced
LMDBCached
class to ensure proper handling of cached samples. - Improved
MultiShardSample
andResampledShards
classes for better shard handling and error reporting. - Added new tests in
test_pipeline.py
to validateLMDBCached
functionality.
a562182 -> 85c6524 @ 2022-03-25 10:42:57 -0700
- Added new caching mechanisms with
Cached
andLMDBCached
classes inwebdataset/filters.py
and integrated them into the pipeline. - Introduced new functions
extract_keys
,rename_keys
, andxdecode
inwebdataset/filters.py
for enhanced data manipulation and decoding capabilities. - Implemented
gopen_ais
function inwebdataset/gopen.py
to support AIS URL scheme and added environment variable handling for URL rewriting. - Enhanced
ResampledShards
class inwebdataset/shardlists.py
for better deterministic and non-deterministic shard sampling. - Updated
tasks.py
to improve version incrementing logic and added better error handling. - Added new tests in
webdataset/tests/test_pipeline.py
to cover the new caching mechanisms, key extraction, and renaming functionalities. - Improved
PipelineStage
class inwebdataset/utils.py
with a newmake_seed
function for better seed generation.
0df0460 -> a562182 @ 2022-03-16 17:19:51 -0700
-
Updated
setup.py
:- Changed the URL to
http://github.com/webdataset/webdataset
.
- Changed the URL to
-
Modified
tasks.py
:- Commented out the installation of specific versions of
torch
andtorchvision
. - Added a conditional check for a clean working tree before creating a GitHub release.
- Removed
mkdocs.yml
from the list of required files.
- Commented out the installation of specific versions of
-
Enhanced
webdataset/cache.py
:- Increased the default cache size to
1e18
. - Replaced the walrus operator with a traditional
while
loop for compatibility.
- Increased the default cache size to
-
Updated
webdataset/shardlists.py
:- Changed the random seed source from
time.time_ns()
totime.time()
.
- Changed the random seed source from
2eaa96e -> 0df0460 @ 2021-11-04 13:21:47 -0700
- Introduced a new
DataPipeline
class to streamline the creation and management of data processing pipelines. - Added support for deterministic shuffling with the
detshuffle
function. - Enhanced the
WebDataset
class with additional methods for handling epochs and repetitions. - Introduced
cached_tarfile_to_samples
andcached_url_opener
for efficient caching of tar files. - Added new handlers for decoding various data formats, including
tenbin
,msgpack
,npy
, andcbor
. - Improved the
MultiShardSample
class to support more flexible shard specifications using YAML. - Added new classes
RoundRobin
andRandomMix
for mixing samples from multiple sources. - Enhanced the
filters
module with new functions likepipelinefilter
,getfirst
, andtransform_with
. - Improved error handling and logging capabilities across various modules.
- Added extensive test coverage for new features and functionalities.
36ebfbd -> 2eaa96e @ 2021-10-24 04:24:12 -0700
-
Enhancements:
- Added
slice
function tofilters.py
usingitertools.islice
. - Improved
decode
function initerators.py
to handle string-based decoding. - Added debug output to
resampled_
function inshardlists.py
. - Added new tests for
slice
andresampled
functions intest_pipeline.py
.
- Added
-
Bug Fixes:
- Fixed import issue in
pipeline.py
by changingtorch.utils.data.IterableDataset
to local import. - Renamed nested functions in
writer.py
to avoid type errors. - Corrected function name from
tarfile_sampler
totarfile_to_samples
intariterators.py
.
- Fixed import issue in
-
New Features:
- Introduced
tarfile_to_samples
function intariterators.py
. - Added
tarfile_to_samples
import in__init__.py
.
- Introduced
-
Miscellaneous:
- Added missing export in
dbcache.py
andextradatasets.py
.
- Added missing export in
008bfd6 -> 36ebfbd @ 2021-10-15 02:41:28 -0700
- Introduced a new
DataPipeline
class inwebdataset/pipeline.py
to streamline the creation and management of data processing pipelines. - Added new functions and classes such as
stage
,split_by_node
,split_by_worker
,resampled
, andnon_empty
to enhance data handling capabilities. - Updated
SimpleShardList
to support shuffling with a seed for reproducibility. - Enhanced the
shuffle
function to improve data shuffling logic. - Added comprehensive tests in
webdataset/tests/test_pipeline.py
to ensure the functionality of the new pipeline and data handling features. - Improved the
WebDataset
function to include ashardshuffle
parameter for better control over shard shuffling. - Added
tarfile_samples
function to simplify the process of reading samples from tar files.
0fc4e54 -> 008bfd6 @ 2021-10-15 02:35:15 -0700
- Updated
setup.py
to reflect the latest changes in the project description and metadata.
dd6f67b -> 0fc4e54 @ 2021-10-15 02:32:16 -0700
- Removed extra output for waiting in
webdataset/gopen.py
by commenting out a debug print statement. - Ensured that the
status
variable is checked and handled correctly without unnecessary verbose output.
cea6299 -> dd6f67b @ 2021-10-15 01:56:38 -0700
- Improved output messages in
webdataset/gopen.py
to include process status and IDs for better debugging. - Fixed an issue in
webdataset/gopen.py
related to the status check and verbose output. - Minor fixes in
VERSION
andsetup.py
files. - Updated
webdataset/gopen.py
to enhance the handling of pipe exit status and verbose logging.
6f94aa5 -> cea6299 @ 2021-09-05 22:26:24 -0700
- Added
RoundRobin
class towebdataset.dsspecs
with methods for adding datasets and string representation. - Enhanced
Pipe
class inwebdataset.gopen
to handle subprocess status more robustly. - Introduced
shuffle_rng
inwebdataset.iterators
for better random seed management. - Updated
ResampledShards
class inwebdataset.shardlists
to include environment and random seed initialization. - Modified
tar_file_iterator
inwebdataset.tariterators
to resetstream.members
after yielding results. - Added missing dependency to
setup.py
. - Improved diagnostics for closing in
GOPEN
output. - Added
__str__
method and comment toRoundRobin
class. - Fixed
StopIteration
issue and deprecatedDataset
inwebdataset
. - Added support for
npz
writing and decoding.
bba3fbe -> 6f94aa5 @ 2021-09-05 22:15:56 -0700
- Introduced a new test
test_dataset_resampled
inwebdataset/tests/test_dataset.py
to verify the functionality of resampled datasets. - Modified the
WebDataset
function inwebdataset/dataset.py
to correctly handleresampled
URLs by assigningResampledShards(urls)
toresult
.
17e429b -> bba3fbe @ 2021-09-05 21:12:55 -0700
- Introduced a
resampled
option to theWebDataset
function inwebdataset/dataset.py
. - Added support for shard resampling by importing
ResampledShards
fromshardlists
. - Modified the
WebDataset
function to handle theresampled
parameter, allowing for shard resampling when set toTrue
. - Updated the
urls
parameter handling to incorporate the newResampledShards
class whenresampled
is enabled.
f8430bc -> 17e429b @ 2021-08-19 12:55:48 -0700
- Fixed setup: Updated
setup.py
to includepyyaml
in theinstall_requires
list, ensuring that the necessary dependencies are installed.
9d823a1 -> f8430bc @ 2021-05-12 20:45:29 -0700
- Introduced new classes and functions for handling datasets, including
Composable
,Shorthands
,Processor
,MockDataset
,Repeatedly
,DatasetTest
,ChoppedDataset
, andFakeLength
. - Added support for YAML-based dataset specifications with
construct_dataset
andMultiShardSample
. - Enhanced shard handling with
PytorchShardList
,SimpleShardList
, andResampledShards
. - Improved error handling and logging with new handlers in
handlers.py
. - Added new encoding functions for
numpy
arrays andtorch
tensors inwriter.py
. - Implemented caching mechanisms for shards in
shardcache.py
. - Updated
tariterators.py
to include additional metadata handling and improved error reporting. - Enhanced the
WebDataset
andWebLoader
classes with additional methods for dataset manipulation and transformation. - Added comprehensive unit tests for new functionalities and dataset handling methods.
f69d879 -> 9d823a1 @ 2021-05-01 12:08:03 -0700
-
New Features:
- Introduced
MockDataset
for generating mock data. - Added
node_equalize
method for equalizing dataset length across nodes. - Implemented
.test
method for easy mock data and sample verification. - Added
DatasetTest
class for performing final checks on datasets and supporting mock tests. - Introduced
split_by_node
andsplit_by_worker
functions for shard selection based on node and worker information.
- Introduced
-
Enhancements:
- Refactored
MultiLoader
to use datasets directly. - Improved length handling in batched datasets.
- Enhanced error handling and warnings for shard and worker distribution.
- Refactored
-
Bug Fixes:
- Fixed issues with
WorkerEnvironment
fallback and group handling. - Corrected length calculations in batched datasets.
- Fixed issues with
-
Testing:
- Updated test cases to remove fluid interface and use new dataset methods.
- Added tests for
MockDataset
,node_equalize
, and.test
method.
16622f3 -> f69d879 @ 2021-04-29 21:28:11 -0700
-
Enhancements:
- Made
torch
optional by adding a mock implementation forIterableDataset
andDataLoader
inwebdataset/mock.py
. - Introduced
ZMQ
-based multi-loader inwebdataset/multi.py
for parallel data loading. - Updated
tasks.py
to streamline virtual environment setup and testing process.
- Made
-
Bug Fixes:
- Fixed missing module imports in
webdataset/dataset.py
,webdataset/dbcache.py
, andwebdataset/fluid.py
by adding conditional imports fortorch
.
- Fixed missing module imports in
-
Code Refactoring:
- Refactored
tasks.py
to use thevenv
function within thevirtualenv
andtest
tasks for consistency.
- Refactored
9d85dbd -> 16622f3 @ 2021-04-21 00:32:50 -0700
- Removed
torch
fromsetup.py
andrequirements.txt
to streamline dependencies. - Updated
tasks.py
to remove the installation of Jupyter Lab extensions. - Adjusted
requirements.dev.txt
to reflect the removal oftorch
.
6d2f2da -> 9d85dbd @ 2021-04-12 17:48:48 -0700
- Enhanced
ImageHandler
: Made image extensions configurable. - Improved
SplitByNode
: Better handling of default group in distributed settings. - Updated
Shorthands
: Addedcollation_fn
argument tobatched
method and exposedonly
argument indecode
shorthand. - Refined
DBCache
: Addedsource_
method and improved logging for database operations.
398cf67 -> 6d2f2da @ 2021-03-16 16:47:17 -0700
-
Enhancements:
- Introduced
SplitByNode
class for node-based URL splitting in distributed environments. - Added
only
parameter toDecoder
class to filter specific keys during decoding. - Enhanced
ShardList
andWebDataset
classes to support node splitting by default. - Improved verbose output in
gopen
to include additional node information. - Added missing
__len__
method toRepeatedly
class for better compatibility.
- Introduced
-
Bug Fixes:
- Fixed issues in
nodesplitter
andsplit_by_worker
functions to handle worker and node information correctly. - Corrected attribute handling in
Dataset
class to ensure proper delegation to the underlying dataset.
- Fixed issues in
-
Refactoring:
- Consolidated imports and improved modularity in
webdataset/fluid.py
. - Streamlined
batched
function parameters for better readability and maintainability.
- Consolidated imports and improved modularity in
773c98d -> 398cf67 @ 2021-03-16 09:45:06 -0700
- Removed debug print statement from
webdataset/utils.py
. - Added
WebLoader
export towebdataset/__init__.py
. - Fixed
setup.py
classifier to correctly list supported Python versions. - Incremented version in
setup.py
to reflect changes.
291f016 -> 773c98d @ 2021-02-17 21:48:28 -0800
- Added a
WebLoader
wrapper forDataLoader
to facilitate repeated loading of datasets. - Introduced new test cases for
WebLoader
andrepeat
functionality. - Enhanced the
repeatedly
function to support repeating based on epochs, batches, or samples. - Fixed an issue with
torchvision.io.read_video
by adding a missing argument. - Included
pillow
inrequirements.txt
. - Added new utility functions and test cases in
webdataset/utils.py
andwebdataset/tests/test_utils.py
. - Improved dataset handling with new methods in
webdataset/dataset.py
, includingsource_
,repeat
, andWebLoader
. - Minor bug fixes and enhancements in various modules.
10ab6df -> 291f016 @ 2021-02-12 00:13:35 -0800
- Improved Docker build process by adding tags and ensuring base container is built before tests.
- Updated
tasks.py
to handle Jupyter labextension installation more robustly. - Changed Docker base image from
ubuntu:19.10
toubuntu:20.04
. - Modified
ShardList
class inwebdataset/dataset.py
to useshuffle=False
by default and updated theshuffle
method to handle size less than 1. - Enhanced Docker test scripts to clone the repository and copy test data for
pypi_test
. - Refined
docker_build
function to accept a tag parameter and apply it during the build process.
d7321fc -> 10ab6df @ 2020-12-19 00:19:58 -0800
- Enhanced
tasks.py
to ensure virtual environment activation for Jupyter lab extensions and added a print statement for completion. - Introduced a
slice
method inShorthands
class withindataset.py
for slicing datasets. - Modified
ShardList
class to accept a callable for shuffling URLs. - Added a new test
test_slice
intest_dataset.py
to verify dataset slicing functionality. - Updated
utils.py
to includeitertools
and added arepeatedly
function for iterating over DataLoader batches. - Enhanced
ShardWriter
class inwriter.py
to support starting from a specified shard number.
759da05 -> d7321fc @ 2020-09-17 00:47:30 -0700
- Introduced a new
fluid
interface for constructing datasets, replacing the olderDataset
class. - Added support for database-based caching with the
DBCache
class. - Enhanced the
autodecode
module with new handlers and improved theDecoder
class. - Refactored the
filters
module to use functions from the newiterators
module. - Introduced
ShardList
,Processor
, andWebDataset
classes for better dataset handling and processing. - Added
shardcache
module for caching shards locally. - Improved error handling and logging across various modules.
- Updated tests to reflect changes in dataset handling and processing.
14b2315 -> 759da05 @ 2020-09-08 22:35:40 -0700
-
Enhancements:
- Added support for decompression of individually compressed files using
gzfilter
. - Introduced
Continue
class for handling continued decoding. - Improved
decode
method inPipeline
to include pre and post handlers. - Added
imagehandler
,torch_video
, andtorch_audio
functions for better handling of image, video, and audio data. - Introduced
MultiDataset
class as an experimental alternative toDataLoader
.
- Added support for decompression of individually compressed files using
-
New Features:
- Added a benchmarking script
bench.py
for performance testing. - Added new tests for compressed files and writer functionalities.
- Added a benchmarking script
-
Bug Fixes:
- Fixed issues with
tenbin
format inShardWriter
. - Corrected handling of non-byte data in
TarWriter
.
- Fixed issues with
-
Miscellaneous:
- Added small test/benchmarking script.
- Updated
ytsamples-split
example and comments.
c30a2d6 -> 14b2315 @ 2020-08-18 08:31:11 -0700
- Introduced a new
Decoder
class to handle sample decoding using a list of handlers. - Added
basichandlers
function for handling basic data types like text, JSON, and integers. - Implemented
ImageHandler
class for decoding image data based on specified image specifications. - Added support for decoding Torch video and audio files using
torchvideo
andtorchaudio
functions. - Updated the
Pipeline.decode
method to accept multiple handlers and ensure backward compatibility with image decoding. - Enhanced the
autodecode
module by removing thedefault_handlers
dictionary and replacing it with more flexible handler functions. - Improved error handling and decoding logic in the
decode_sample_based_on_extensions
function. - Updated test cases to reflect changes in the decoding mechanism and ensure proper functionality.
9c58006 -> c30a2d6 @ 2020-06-13 22:26:15 -0700
- Added support for
.pth
files and various video and audio formats inwebdataset/autodecode.py
with new handlers fortorch
andtorchvision
. - Introduced
TorchVideoLoader
andTorchAudioLoader
classes for handling video and audio data. - Enhanced
webdataset/writer.py
to include atorch_save_object
function for savingtorch
objects. - Fixed a bug in
webdataset/filters.py
related to combiningnumpy
arrays. - Added new tests in
webdataset/tests/test_writer.py
to verify the functionality of writing and reading.pth
files and other data types.
2216314 -> 9c58006 @ 2020-06-11 22:41:54 -0700
- Fixed: Small fix in export functionality.
- Modified:
__all__
inwebdataset/gopen.py
to includegopen_schemes
. - Updated: Various documentation files (
README.ipynb
,README.md
,docs/index.md
,docs/pydoc.md
) with significant changes. - Added: New dependency in
requirements.txt
.
8d0d9fc -> 2216314 @ 2020-05-20 23:18:16 -0700
-
New Features:
- Introduced
MultiDataset
andMultiDatasetIterator
classes for parallel data loading using multiple workers. - Added
SampleIterator
class for iterating over samples with a given processing pipeline. - Implemented
Pipeline
class for building fluid data processing pipelines. - Added
Curried
andCurried2
helper classes for currying pipeline stages. - Introduced
unbatched
function to reverse the batching process.
- Introduced
-
Enhancements:
- Improved shard selection with
worker_urls
andall_urls
functions. - Enhanced
gopen
function to support additional options and verbose output. - Refactored
Dataset
class to use newPipeline
andSampleIterator
classes. - Updated
filters
module with curried versions of functions likemap_stream
,info
,shuffle
,select
,decode
,map
,rename
,associate
,map_dict
,to_tuple
,map_tuple
,batched
, andunbatched
.
- Improved shard selection with
-
Bug Fixes:
- Fixed issues with tensor handling in
autodecode.py
by ensuring proper array conversion and type casting. - Addressed potential memory issues by adding garbage collection triggers in
tardata
.
- Fixed issues with tensor handling in
-
Testing:
- Added new test cases in
test_dataset.py
andtest_multi.py
to cover new functionalities and ensure robustness.
- Added new test cases in
b31f90a -> 8d0d9fc @ 2020-05-20 23:06:51 -0700
- Added:
ResizedDataset
to thewebdataset/__init__.py
file. - Fixed: Missing export in the
webdataset
module.
f8460e9 -> b31f90a @ 2020-05-19 18:21:10 -0700
- Introduced batching functionality in the
Dataset
class with a newbatched
method. - Added
batch_tensors
andsamples_to_batch
functions inwebdataset/filters.py
to handle tensor and scalar batching. - Implemented a
batched
function inwebdataset/filters.py
to create batches of a specified size. - Added a new test
test_batched
inwebdataset/tests/test_dataset.py
to verify the batching functionality. - Updated
webdataset/dataset.py
to include the new batching method in the data processing pipeline.
1ba422f -> f8460e9 @ 2020-05-11 21:56:56 -0700
- Renamed the class
ChoppedDataset
toResizedDataset
inwebdataset/dataset.py
. - Updated the class docstring to reflect the new name.
- Modified the
__init__
method and other relevant parts of the class to use the new name. - Added an alias
ChoppedDataset = ResizedDataset
for backward compatibility.
4aa6230 -> 1ba422f @ 2020-05-07 10:46:48 -0700
- Removed the deprecated
WebDataset
class and its associated test filetest_webdataset.py
. - Updated references from
WebDataset
toDataset
inwebdataset/dataset.py
andwebdataset/tests/test_writer.py
. - Simplified the
__all__
list inwebdataset/dataset.py
by removingWebDataset
. - Deleted the
webdataset/webdataset.py
file, which contained the deprecatedWebDataset
class. - Reduced the overall codebase by 642 lines, focusing on removing outdated and redundant code.
6a3a17d -> 4aa6230 @ 2020-05-06 15:12:20 -0700
-
Dataset Enhancements:
- Introduced a random number generator (
rng
) to theDataset
class for improved shuffling. - Updated
shuffle
function infilters.py
to accept a customrng
parameter. - Modified dataset tests to use OpenImages dataset instead of ImageNet.
- Adjusted dataset tests to reflect changes in data source and structure.
- Introduced a random number generator (
-
Test Suite Adjustments:
- Updated test cases to align with the new dataset structure and data sources.
- Commented out or removed redundant tests related to the old dataset format.
- Ensured compatibility with the new dataset by modifying test parameters and expected outputs.
322bef4 -> 6a3a17d @ 2020-05-05 21:46:01 -0700
- Introduced the
GOPEN_BUFFER
environment variable to control the buffer size for file operations ingopen
. - Modified
gopen
function inwebdataset/gopen.py
to use theGOPEN_BUFFER
environment variable for setting the buffer size when opening files.
e8d4ee7 -> 322bef4 @ 2020-03-20 01:30:41 -0700
- Introduced
ChoppedDataset
class to handle datasets with custom length and nominal length, improving flexibility in dataset iteration and epoch boundaries. - Enhanced
Dataset
class by moving length logic to the newChoppedDataset
class. - Improved length handling for multi-worker datasets.
- Added tests for
ChoppedDataset
to ensure correct functionality with various dataset sizes and configurations. - Minor fixes and improvements in dataset export functionality.
52d8e3e -> e8d4ee7 @ 2020-03-19 20:32:23 -0700
- Fixed a bug in
webdataset/filters.py
by modifying theselect
function to remove theinvert
parameter and correctly apply thepredicate
to eachsample
. - Updated
setup.py
to reflect the changes.
147b2d9 -> 52d8e3e @ 2020-03-18 21:40:25 -0700
- Added a
select
method to theDataset
class for filtering samples based on a predicate. - Enhanced the
shuffle
method inDataset
to accept additional keyword arguments. - Modified
getfirst
function to raise an error by default if a key is missing, controlled bymissing_is_error
parameter. - Introduced a
select
filter infilters.py
to yield samples based on a predicate. - Added tests to ensure exceptions are raised for missing fields in
to_tuple
andrename
methods inDataset
. - Updated
map_tuple
andto_tuple
functions to handle missing fields more robustly.
e70675e -> 147b2d9 @ 2020-03-18 20:43:18 -0700
- Enhancements:
- Updated
webdataset/gopen.py
to ignore additional curl status codes (23 and 26) for improved error handling during read and write operations.
- Updated
bd82a85 -> e70675e @ 2020-03-17 16:30:55 -0700
- Refactored IO Module: Renamed
io
module togopen
to avoid name conflicts and updated all references accordingly. - Documentation Enhancements: Improved documentation generation in
tasks.py
by adding conversion of IPython Notebooks to Markdown and generating help text for each command. - New Functionality: Added
info
function infilters.py
for logging sample information. - Enhanced
gopen
Functionality: Introduced multiple handlers for different URL schemes (pipe
,http
,https
,sftp
,ftps
,scp
) ingopen.py
. - Test Updates: Renamed and updated tests to reflect changes in the
gopen
module. - Deprecation Notice: Added a deprecation notice in
webdataset.py
indicating that the code will be removed soon and suggesting the use ofwebdataset.Dataset
instead.
70f01ca -> bd82a85 @ 2020-03-16 23:50:05 -0700
- Renamed the
add_stage
method topipe
in theDataset
class to better reflect its functionality. - Added a new script
run-jupyterlab
with 46 lines of code. - Removed the entire
Dataset
class fromwebdataset/webdataset.py
, resulting in a significant reduction of 221 lines. - Minor adjustments in
setup.py
andwebdataset/dataset.py
to reflect the renaming of the method.
71556b1 -> 70f01ca @ 2020-03-13 00:56:50 -0700
- Introduced a PyPI publishing workflow in
.github/workflows/pypi.yml
. - Renamed exception handling functions in
webdataset
:ignore_exception
toignore_and_continue
warn_exception
towarn_and_continue
ignore_and_finish
toignore_and_stop
- Updated exception handling references in
webdataset/__init__.py
,webdataset/dataset.py
,webdataset/tests/test_dataset.py
, andwebdataset/webdataset.py
to reflect the new function names.
77025f3 -> 71556b1 @ 2020-03-09 00:46:52 -0700
-
Refactoring and Modularization:
- Refactored the
WebDataset
class into a more modularDataset
class, allowing for more flexible pipeline stages. - Introduced new functions and classes in
filters.py
to handle common data transformations and decoding tasks. - Moved decoding logic to a separate
autodecode.py
file for better separation of concerns.
- Refactored the
-
Error Handling Enhancements:
- Added various error handling strategies (
reraise_exception
,ignore_exception
,warn_exception
,ignore_and_stop
,warn_and_stop
) to improve robustness during data processing.
- Added various error handling strategies (
-
Pipeline Enhancements:
- Added methods to the
Dataset
class for adding pipeline stages (add_stage
,shuffle
,decode
,map
,rename
,map_dict
,to_tuple
,map_tuple
). - Improved the
tariterator
function to handle errors more gracefully and to support custom decoding.
- Added methods to the
-
Testing Improvements:
- Updated and expanded test cases to cover new functionalities and ensure robustness.
- Added new test files
test_webdataset.py
and updatedtest_dataset.py
to reflect changes in the dataset handling and pipeline processing.
-
Dependency Updates:
- Updated dependencies in
setup.py
to reflect changes in the codebase, ensuring compatibility and stability.
- Updated dependencies in