Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework delta handling code in order to support large repositories #81

Open
schrepfler opened this issue Aug 6, 2024 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@schrepfler
Copy link

Description

Currently libraries that use jgit like egit suffer from TooLargeObjectInPackException when the invoked repository contains larger files. Arguably this is not the way how to use source control but these things do happen.

Caused by: org.eclipse.jgit.errors.TooLargeObjectInPackException: Object too large (2,887,318,710 bytes), rejecting the pack. Max object size limit is 2,147,483,639 bytes.

I believe default limit should be to mimic whatever limit git has and set it to that, and if it's something higher should be raised to the higher value and ultimately it should be possible to disable the check.

As mentioned here the delta handling code requires the target to be a single Java byte array, maybe figure out alternative implementation or code path in order to support bigger repositories.

Motivation

Repositories with large files are unfortunately a fact of life, especially since hosted git lfs solutions come at a premium many people opt out to host large files in git.

Alternatives considered

No response

Additional context

No response

@tomaswolf
Copy link
Contributor

tomaswolf commented Aug 19, 2024

This is not trivial. The basic problem is that a delta is composed of COPY and INSERT instructions, and COPY instruction may copy data from the base out of order. See e.g. the comment at

* An {@link InputStream} that applies a binary delta to a base on the fly.
So one needs efficient random access to the whole base. A COPY instruction has the format "COPY offset length" and says "copy length bytes from the base, starting at offset, to the output". Offset is an uint32, so limited to 4GB, while length is in the range [1 .. 2^24-1].

There was an attempt to stream the base, but it turned out to be too slow. See commit 62697c8 and the mail referenced in that commit comment.

Also see the comments on Gerrit change 190382.

For applying binary patches, C git has limit of 1024 * 1024 * 1023 bytes, a little less than 1GB. See https://github.com/git/git/blob/b9849e4f7631d80f146d159bf7b60263b3205632/apply.c#L414 .

For delta-compression in pack files, I see no such limit on the length. There is a limit on the copy length of just 64kB, though: https://github.com/git/git/blob/b9849e4f7631d80f146d159bf7b60263b3205632/diff-delta.c#L432 . (For pack v2)

Given that the offset in a COPY instruction is limited to 4GB, one actually "only" needs fast random access to the first 4GB of a base. Perhaps just using multiple arrays (as mentioned in Gerrit change 190382) to cover these first 4GB might be a way. Of course, it might need 4GB (plus some more) of JVM heap...

Another idea from that Gerrit change was to apply the 2GB limit only to deltas. But that might give strange effects. (Blob can be handled initially if not delta compressed, but cannot be handled after repacking, when it might have become delta-compressed?)

@msohn msohn added the enhancement New feature or request label Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants