-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronize between uploading file and generating index thread #4443
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove the .pyc files? Also let's add them to .gitignore
@@ -286,7 +286,7 @@ def _get_azure_sas_url(self, path, **kwargs): | |||
account_name=AZURE_BLOB_ACCOUNT_NAME, | |||
container_name=AZURE_BLOB_CONTAINER_NAME, | |||
account_key=AZURE_BLOB_ACCOUNT_KEY, | |||
expiry=datetime.datetime.now() + datetime.timedelta(hours=1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we refactor this into a constant? Also, is 10 hours enough or might we need more for even larger files?
self._current_max_buf_length = len(self._bufs[0]) | ||
for i in range(1, self.NUM_READERS): | ||
self._current_max_buf_length = max(self._current_max_buf_length, len(self._bufs[i])) | ||
# print(f"find largest buffer: {self._current_max_buf_length} in thread: {threading.current_thread().name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comments
@@ -10,11 +10,16 @@ class MultiReaderFileStream(BytesIO): | |||
""" | |||
NUM_READERS = 2 | |||
|
|||
# MAX memory usage <= MAX_BUF_SIZE + max(num_bytes called in read) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean? Can you add a description of what MAX_BUF_SIZE is used for?
@@ -10,11 +10,16 @@ class MultiReaderFileStream(BytesIO): | |||
""" | |||
NUM_READERS = 2 | |||
|
|||
# MAX memory usage <= MAX_BUF_SIZE + max(num_bytes called in read) | |||
MAX_BUF_SIZE = 1024 * 1024 * 1024 # 10 MiB for test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What value should it be for non-test?
self.find_largest_buffer() | ||
|
||
def find_largest_buffer(self): | ||
self._current_max_buf_length = len(self._bufs[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring comment
@@ -71,6 +71,7 @@ def test_not_found(self): | |||
|
|||
def check_file_target_contents(self, target): | |||
"""Checks to make sure that the specified file has the contents 'hello world'.""" | |||
# This can not be checked, Since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment?
…ksheets into parallel-upload-sync
Reasons for making this change
#4370 does not have synchronization between threads (upload file thread and generate index thread), so it does not work for large files like imagenet (158 GB).
Related issues
Screenshots
Checklist