Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support optimized fetching and caching #137

Open
kriswuollett opened this issue May 29, 2024 · 3 comments
Open

Support optimized fetching and caching #137

kriswuollett opened this issue May 29, 2024 · 3 comments

Comments

@kriswuollett
Copy link

Just started trying out protofetch upon a recommendation. However the first thing I noticed is how slow it can be depending on the source, currently only git. The example I've encountered was setting up the following dependency:

[grpc_health_v1]
url = "github.com/grpc/grpc"
revision = "b8a04acbbf18fd1c805e5d53d62ed9fa4721a4d1" # v1.64.0
protocol = "https"
allow_policies = ["src/proto/grpc/health/v1/*"]

The grpc/health/v1/health.proto file is just 2416 B. Looks like it mirrored the entire repo into ~/.cache/protofetch/github.com/grpc/grpc taking up 416 MB and about a minute for it to be ready. Performance is machine and network dependent of course, I'm using an M2 mac. And when doing a shallow git clone myself this is the output to see also network performance:

% git clone --depth=1 https://github.com/grpc/grpc
Cloning into 'grpc'...
remote: Enumerating objects: 13476, done.
remote: Counting objects: 100% (13476/13476), done.
remote: Compressing objects: 100% (8198/8198), done.
remote: Total 13476 (delta 4629), reused 10048 (delta 3865), pack-reused 0
Receiving objects: 100% (13476/13476), 19.37 MiB | 10.66 MiB/s, done.
Resolving deltas: 100% (4629/4629), done.
Updating files: 100% (12308/12308), done.

The shallow clone takes up less space, just 178 MB.

So my thought was, even if one could use a repo mirror to support multiple versions of different deps from the same source, would it really beat the efficiency, in practice, of a "shallow" fetch and strip out all but the proto files? Perhaps even wrapped in a .tar.gz that could just be streamed and decoded in memory when needed. I'd think actual git mirrors or clones would only be necessary if fetching git submodules was supported.

Shouldn't be the user's fault that the proto repo is too large.

Also noticed that revision can be a tag or a hash. IMO both should be supported and use the hash to confirm the tag when both are provided. Git tags are not constants, and being able to specify both serves as functional documentation rather than just a manual code comment I'd be doing now like in the above example.

In any case, if there is any desire to support a potentially breaking config change in the future, I'd think it would be great to support different fetch types like plain http (tarball) with optional sha256 checks as well despite sometimes the hash of a source like git source archives may not be guaranteed for long term on some platforms.

@kriswuollett
Copy link
Author

The performance and comparisons to note:

protofetch fetch (416 MB cached in my home dir, 2 minutes 6 seconds)

% time protofetch fetch
INFO Resolving github.com/grpc/grpc
INFO Fetching dependencies source files...
INFO Copying proto files from appbiotic descriptor...
INFO Creating new worktree for grpc_health_v1 at /Users/kris/.cache/protofetch/dependencies/grpc_health_v1/b8a04acbbf18fd1c805e5d53d62ed9fa4721a4d1.
protofetch fetch  58.81s user 6.38s system 51% cpu 2:05.90 total

Shallow git clone (178 MB, 5 seconds):

% time git clone --depth=1 https://github.com/grpc/grpc
Cloning into 'grpc'...
remote: Enumerating objects: 13476, done.
remote: Counting objects: 100% (13476/13476), done.
remote: Compressing objects: 100% (8196/8196), done.
remote: Total 13476 (delta 4632), reused 10045 (delta 3867), pack-reused 0
Receiving objects: 100% (13476/13476), 19.37 MiB | 9.00 MiB/s, done.
Resolving deltas: 100% (4632/4632), done.
Updating files: 100% (12308/12308), done.
git clone --depth=1 https://github.com/grpc/grpc  1.13s user 1.07s system 41% cpu 5.265 total

Command line curl and tar (984 K, 4 seconds):

% time curl -sSL https://github.com/grpc/grpc/archive/b8a04acbbf18fd1c805e5d53d62ed9fa4721a4d1.tar.gz | tar -C protos --strip-components 3 -zxf - '**/*.proto' 
curl -sSL   0.23s user 0.07s system 8% cpu 3.720 total
tar -C protos --strip-components 3 -zxf - '**/*.proto'  0.24s user 0.05s system 7% cpu 3.720 total

Repackaging the proto files into a tar.gz archive took 71 ms and would take up only 105 KB on my disk.

@rtimush
Copy link
Collaborator

rtimush commented May 29, 2024

Hi @kriswuollett, thank you for opening the issue. There is definitely some room for optimization here, we just didn't have enough time to work on it. I will try to take a look.

Also noticed that revision can be a tag or a hash. IMO both should be supported and use the hash to confirm the tag when both are provided. Git tags are not constants, and being able to specify both serves as functional documentation rather than just a manual code comment I'd be doing now like in the above example.

The "expected" workflow is that you specify a tag or a branch in your protofetch.toml, and the exact commit hash is written to the protofetch.lock file. This way you can have both reproducible fetches (with protofetch fetch --locked) and ergonomic updates if you are following a branch (with protofetch update).

@kriswuollett
Copy link
Author

The "expected" workflow is that you specify a tag or a branch in your protofetch.toml, and the exact commit hash is written to the protofetch.lock file. This way you can have both reproducible fetches (with protofetch fetch --locked) and ergonomic updates if you are following a branch (with protofetch update).

Ah, yes, the lockfile helps. My previous experience of using something similar, sha256 checksum with Bazel http_archive for external dependencies, not necessarily git, was showing.

rtimush added a commit that referenced this issue Jun 28, 2024
- When we know the commit hash, only fetch this commit (and its
ancestors)
- When we only have a revision/branch, only fetch the relevant refs (and
their ancestors).

This makes fetches significantly faster. For example, for
googleapis/googleapis, it decreases the time from 1m20s to about 30s.

An even bigger improvement would be to
1. Shallow fetch. This is supported by libgit2 but I couldn't make it
work.
2. Sparse checkout. This is not even supported by libgit2.

#137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants