improve tar handler to support sparse archives. #684

qkaiser · 2023-12-01T13:08:42Z

Resolve #683

TODO:

add sparse samples to integration tests

martonilles · 2023-12-14T16:51:27Z

just wondering, but isn't it a bug in python tar implementation? is there an upstream bug for this?

I guess the difference comes from where tar takes the size:

    @classmethod
    def frombuf(cls, buf, encoding, errors):
        """Construct a TarInfo object from a 512 byte bytes object.
        """
        if len(buf) == 0:
            raise EmptyHeaderError("empty header")
        if len(buf) != BLOCKSIZE:
            raise TruncatedHeaderError("truncated header")
        if buf.count(NUL) == BLOCKSIZE:
            raise EOFHeaderError("end of file header")

        chksum = nti(buf[148:156])
        if chksum not in calc_chksums(buf):
            raise InvalidHeaderError("bad checksum")

        obj = cls()
        obj.name = nts(buf[0:100], encoding, errors)
        obj.mode = nti(buf[100:108])
        obj.uid = nti(buf[108:116])
        obj.gid = nti(buf[116:124])
        obj.size = nti(buf[124:136])
        obj.mtime = nti(buf[136:148])
        obj.chksum = chksum
        obj.type = buf[156:157]
        obj.linkname = nts(buf[157:257], encoding, errors)
        obj.uname = nts(buf[265:297], encoding, errors)
        obj.gname = nts(buf[297:329], encoding, errors)
        obj.devmajor = nti(buf[329:337])
        obj.devminor = nti(buf[337:345])
        prefix = nts(buf[345:500], encoding, errors)

        # Old V7 tar format represents a directory as a regular
        # file with a trailing slash.
        if obj.type == AREGTYPE and obj.name.endswith("/"):
            obj.type = DIRTYPE

        # The old GNU sparse format occupies some of the unused
        # space in the buffer for up to 4 sparse structures.
        # Save them for later processing in _proc_sparse().
        if obj.type == GNUTYPE_SPARSE:
            pos = 386
            structs = []
            for i in range(4):
                try:
                    offset = nti(buf[pos:pos + 12])
                    numbytes = nti(buf[pos + 12:pos + 24])
                except ValueError:
                    break
                structs.append((offset, numbytes))
                pos += 24
            isextended = bool(buf[482])
            origsize = nti(buf[483:495])
            obj._sparse_structs = (structs, isextended, origsize)

        # Remove redundant slashes from directories.
        if obj.isdir():
            obj.name = obj.name.rstrip("/")

        # Reconstruct a ustar longname.
        if prefix and obj.type not in GNU_TYPES:
            obj.name = prefix + "/" + obj.name
        return obj

qkaiser · 2023-12-14T17:33:28Z

@martonilles to me it's not an upstream bug, it's just that the developers chose to consider that a TarInfo size attribute represents the original file size that was put in the archive, not the sparse'd version of it that's actually in the archive. Probably so they can display nice files listing of the content à la 7z, even with sparse archives.

They do this in _proc_sparse, which is called from _proc_member. In the function, once the sparse structs are parsed they simply do:

self.size = origsize

That's why unblob fails on sparse tar archives since the handler finds an end offset past the end of the archive, given that the original file size is bigger than the sparse'd size.

qkaiser · 2023-12-14T18:31:40Z

Found a better solution with @e3krisztian

qkaiser added bug Something isn't working format:archive labels Dec 1, 2023

qkaiser self-assigned this Dec 1, 2023

qkaiser marked this pull request as draft December 1, 2023 13:08

qkaiser force-pushed the 683-fix-sparse-tar branch from 623b4b7 to e30118f Compare December 14, 2023 12:57

qkaiser marked this pull request as ready for review December 14, 2023 12:58

qkaiser requested a review from e3krisztian December 14, 2023 12:58

qkaiser force-pushed the 683-fix-sparse-tar branch from e30118f to 7b22f30 Compare December 14, 2023 13:02

qkaiser force-pushed the 683-fix-sparse-tar branch from 7b22f30 to 7e81c74 Compare December 14, 2023 18:31

e3krisztian approved these changes Dec 14, 2023

View reviewed changes

fix(handler): improve tar handler to support sparse archives.

2b5a2fa

qkaiser force-pushed the 683-fix-sparse-tar branch from 7e81c74 to 2b5a2fa Compare December 16, 2023 13:07

qkaiser enabled auto-merge December 16, 2023 13:07

qkaiser merged commit ad76949 into main Dec 16, 2023
13 checks passed

qkaiser deleted the 683-fix-sparse-tar branch December 16, 2023 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve tar handler to support sparse archives. #684

improve tar handler to support sparse archives. #684

qkaiser commented Dec 1, 2023 •

edited

Loading

martonilles commented Dec 14, 2023

qkaiser commented Dec 14, 2023

qkaiser commented Dec 14, 2023

improve tar handler to support sparse archives. #684

improve tar handler to support sparse archives. #684

Conversation

qkaiser commented Dec 1, 2023 • edited Loading

martonilles commented Dec 14, 2023

qkaiser commented Dec 14, 2023

qkaiser commented Dec 14, 2023

qkaiser commented Dec 1, 2023 •

edited

Loading