Synchronize between uploading file and generating index thread #4443

wwwjn · 2023-04-18T20:20:36Z

Reasons for making this change

#4370 does not have synchronization between threads (upload file thread and generate index thread), so it does not work for large files like imagenet (158 GB).

Related issues

Screenshots

Checklist

I've added a screenshot of the changes, if this is a frontend change
I've added and/or updated tests, if this is a backend change
I've run the pre-commit.sh script
I've updated docs, if needed

epicfaace

Can you please remove the .pyc files? Also let's add them to .gitignore

epicfaace · 2023-05-31T18:21:36Z

codalab/common.py

@@ -286,7 +286,7 @@ def _get_azure_sas_url(self, path, **kwargs):
            account_name=AZURE_BLOB_ACCOUNT_NAME,
            container_name=AZURE_BLOB_CONTAINER_NAME,
            account_key=AZURE_BLOB_ACCOUNT_KEY,
-            expiry=datetime.datetime.now() + datetime.timedelta(hours=1),


Should we refactor this into a constant? Also, is 10 hours enough or might we need more for even larger files?

epicfaace · 2023-05-31T18:21:55Z

codalab/lib/beam/MultiReaderFileStream.py

+        self._current_max_buf_length = len(self._bufs[0])
+        for i in range(1, self.NUM_READERS):
+            self._current_max_buf_length = max(self._current_max_buf_length, len(self._bufs[i]))
+        # print(f"find largest buffer:  {self._current_max_buf_length} in thread: {threading.current_thread().name}")


remove comments

epicfaace · 2023-05-31T18:22:06Z

codalab/lib/beam/MultiReaderFileStream.py

@@ -10,11 +10,16 @@ class MultiReaderFileStream(BytesIO):
    """
    NUM_READERS = 2

+    # MAX memory usage <= MAX_BUF_SIZE + max(num_bytes called in read)


What does this comment mean? Can you add a description of what MAX_BUF_SIZE is used for?

epicfaace · 2023-05-31T18:22:14Z

codalab/lib/beam/MultiReaderFileStream.py

@@ -10,11 +10,16 @@ class MultiReaderFileStream(BytesIO):
    """
    NUM_READERS = 2

+    # MAX memory usage <= MAX_BUF_SIZE + max(num_bytes called in read)
+    MAX_BUF_SIZE = 1024 * 1024 * 1024  # 10 MiB for test


What value should it be for non-test?

epicfaace · 2023-05-31T18:22:26Z

codalab/lib/beam/MultiReaderFileStream.py

+                self.find_largest_buffer()
+
+    def find_largest_buffer(self):
+        self._current_max_buf_length = len(self._bufs[0])


Add a docstring comment

epicfaace · 2023-05-31T18:22:32Z

tests/unit/server/upload_download_test.py

@@ -71,6 +71,7 @@ def test_not_found(self):

    def check_file_target_contents(self, target):
        """Checks to make sure that the specified file has the contents 'hello world'."""
+        # This can not be checked, Since


Update comment?

…ksheets into parallel-upload-sync

epicfaace and others added 30 commits August 16, 2022 22:26

POC: parallel file uploading and index creation

ac5ed66

local test

b9fbc35

still buggy

20aea6d

find error with GZipStrema

a1842a4

clean

ce7de68

more tests

78e2b0d

more tests

14cbc59

might be good

fda189a

clean

54507b9

Merge remote-tracking branch 'origin/master' into parallel-upload

40e0fe8

indexed_gzip success, but does not work for folder

de13ee5

works for both file and folder

2db90e7

format

bde9f10

test

cdb2240

Merge remote-tracking branch 'origin/master' into parallel-upload

0a6e9da

fix unit test

d4c6b58

fix

3a9f8d0

fix

8ba4f68

fix half

0f409b0

fix file size

5c9cc8a

temporary fix to pass unittest

d4509f6

fix

99402f7

update file size

5f55e54

add API

f29e7fa

finish test1

5cc9025

checkout all tests

cbbb4f9

fix client

5ce3bf8

fix unittest

0490cf5

fix client

90b13f0

fix format

e48d91b

wwwjn and others added 21 commits March 18, 2023 01:35

Merge remote-tracking branch 'origin/master' into parallel-upload

ce715e8

fix stream file error

8b28b1f

comment

76d2888

test

61617b3

revert changes, simpler GHA

b0af9ed

fix

7741c94

revert gha changes

f193e80

cleanup v1

77480f1

delete pycache

1bd3bbd

rm __pycache__

8bb7c5b

more fix

d6bdefd

fmt

f0e3215

fix fmt

6d86462

fix docs

2ab01be

change signed url expire time

f6ab881

synchronize busy waiting version

f4589ba

Merge branch 'master' into parallel-upload

f89109b

fmt

c06f37d

fix upload1

75f81af

merge master

6b96ebb

Merge remote-tracking branch 'origin/master' into parallel-upload-sync

9f524ef

epicfaace requested changes Apr 26, 2023

View reviewed changes

wwwjn and others added 3 commits April 26, 2023 13:06

rm pycache files

7f02173

fix ignore files

5d6c496

Merge branch 'master' into parallel-upload-sync

5c3c7b5

epicfaace reviewed May 31, 2023

View reviewed changes

wwwjn and others added 4 commits June 5, 2023 22:26

Merge branch 'master' into parallel-upload-sync

12c3fe0

fix format

b2cf26b

Merge branch 'parallel-upload-sync' of github.com:codalab/codalab-wor…

3037b17

…ksheets into parallel-upload-sync

add

a7b473d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize between uploading file and generating index thread #4443

Synchronize between uploading file and generating index thread #4443

wwwjn commented Apr 18, 2023

epicfaace left a comment

epicfaace May 31, 2023

epicfaace May 31, 2023

epicfaace May 31, 2023

epicfaace May 31, 2023

epicfaace May 31, 2023

epicfaace May 31, 2023

Synchronize between uploading file and generating index thread #4443

Are you sure you want to change the base?

Synchronize between uploading file and generating index thread #4443

Conversation

wwwjn commented Apr 18, 2023

Reasons for making this change

Related issues

Screenshots

Checklist

epicfaace left a comment

Choose a reason for hiding this comment

epicfaace May 31, 2023

Choose a reason for hiding this comment

epicfaace May 31, 2023

Choose a reason for hiding this comment

epicfaace May 31, 2023

Choose a reason for hiding this comment

epicfaace May 31, 2023

Choose a reason for hiding this comment

epicfaace May 31, 2023

Choose a reason for hiding this comment

epicfaace May 31, 2023

Choose a reason for hiding this comment