Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve tar handler to support sparse archives. #684

Merged
merged 1 commit into from
Dec 16, 2023
Merged

Conversation

qkaiser
Copy link
Contributor

@qkaiser qkaiser commented Dec 1, 2023

Resolve #683

TODO:

  • add sparse samples to integration tests

@qkaiser qkaiser added bug Something isn't working format:archive labels Dec 1, 2023
@qkaiser qkaiser self-assigned this Dec 1, 2023
@qkaiser qkaiser marked this pull request as draft December 1, 2023 13:08
@qkaiser qkaiser marked this pull request as ready for review December 14, 2023 12:58
@qkaiser qkaiser requested a review from e3krisztian December 14, 2023 12:58
@martonilles
Copy link
Contributor

just wondering, but isn't it a bug in python tar implementation? is there an upstream bug for this?

I guess the difference comes from where tar takes the size:

    @classmethod
    def frombuf(cls, buf, encoding, errors):
        """Construct a TarInfo object from a 512 byte bytes object.
        """
        if len(buf) == 0:
            raise EmptyHeaderError("empty header")
        if len(buf) != BLOCKSIZE:
            raise TruncatedHeaderError("truncated header")
        if buf.count(NUL) == BLOCKSIZE:
            raise EOFHeaderError("end of file header")

        chksum = nti(buf[148:156])
        if chksum not in calc_chksums(buf):
            raise InvalidHeaderError("bad checksum")

        obj = cls()
        obj.name = nts(buf[0:100], encoding, errors)
        obj.mode = nti(buf[100:108])
        obj.uid = nti(buf[108:116])
        obj.gid = nti(buf[116:124])
        obj.size = nti(buf[124:136])
        obj.mtime = nti(buf[136:148])
        obj.chksum = chksum
        obj.type = buf[156:157]
        obj.linkname = nts(buf[157:257], encoding, errors)
        obj.uname = nts(buf[265:297], encoding, errors)
        obj.gname = nts(buf[297:329], encoding, errors)
        obj.devmajor = nti(buf[329:337])
        obj.devminor = nti(buf[337:345])
        prefix = nts(buf[345:500], encoding, errors)

        # Old V7 tar format represents a directory as a regular
        # file with a trailing slash.
        if obj.type == AREGTYPE and obj.name.endswith("/"):
            obj.type = DIRTYPE

        # The old GNU sparse format occupies some of the unused
        # space in the buffer for up to 4 sparse structures.
        # Save them for later processing in _proc_sparse().
        if obj.type == GNUTYPE_SPARSE:
            pos = 386
            structs = []
            for i in range(4):
                try:
                    offset = nti(buf[pos:pos + 12])
                    numbytes = nti(buf[pos + 12:pos + 24])
                except ValueError:
                    break
                structs.append((offset, numbytes))
                pos += 24
            isextended = bool(buf[482])
            origsize = nti(buf[483:495])
            obj._sparse_structs = (structs, isextended, origsize)

        # Remove redundant slashes from directories.
        if obj.isdir():
            obj.name = obj.name.rstrip("/")

        # Reconstruct a ustar longname.
        if prefix and obj.type not in GNU_TYPES:
            obj.name = prefix + "/" + obj.name
        return obj

@qkaiser
Copy link
Contributor Author

qkaiser commented Dec 14, 2023

@martonilles to me it's not an upstream bug, it's just that the developers chose to consider that a TarInfo size attribute represents the original file size that was put in the archive, not the sparse'd version of it that's actually in the archive. Probably so they can display nice files listing of the content à la 7z, even with sparse archives.

They do this in _proc_sparse, which is called from _proc_member. In the function, once the sparse structs are parsed they simply do:

self.size = origsize

That's why unblob fails on sparse tar archives since the handler finds an end offset past the end of the archive, given that the original file size is bigger than the sparse'd size.

@qkaiser
Copy link
Contributor Author

qkaiser commented Dec 14, 2023

Found a better solution with @e3krisztian

@qkaiser qkaiser enabled auto-merge December 16, 2023 13:07
@qkaiser qkaiser merged commit ad76949 into main Dec 16, 2023
13 checks passed
@qkaiser qkaiser deleted the 683-fix-sparse-tar branch December 16, 2023 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working format:archive
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing support for sparse tar archives
3 participants