BUG REPORT: Add buggy test case (files created by pbzip2) #7

luispedro · 2019-07-31T11:41:56Z

This is more of a bug report than a real PR in that it adds a test case that fails, but I have no fix ATM.

Originally reported as a bug in NGLess (see ngless-toolkit/ngless#116). After the original report, @unode provided the following analysis:

If using pbzip2 the parallel version of bzip2 to create the files,
ngless is able to consume the files up to a certain size. In the
test-case I setup locally a Fastq file with 9724 lines, (266413 bytes
compressed, 900170 uncompressed) causes ngless to fail with
BZ2_bzDecompress: -1. Regular unix bzip2 is able to decompress the file
without problems.

On the other hand if using regular bzip2, tried as many as 90000 lines
and ngless is still able to consume the files without error.

There is more detail at ngless-toolkit/ngless#116

@unode

Originally reported as a bug in NGLess (see ngless-toolkit/ngless#116). After the original report, @unode provided the following analysis: > If using pbzip2 the parallel version of bzip2 to create the files, > ngless is able to consume the files up to a certain size. In the > test-case I setup locally a Fastq file with 9724 lines, (266413 bytes > compressed, 900170 uncompressed) causes ngless to fail with > BZ2_bzDecompress: -1. Regular unix bzip2 is able to decompress the file > without problems. > > On the other hand if using regular bzip2, tried as many as 90000 lines > and ngless is still able to consume the files without error.

snoyberg · 2019-08-01T06:34:42Z

It sounds likely that this is an issue stemming from the underlying C library, not from the Haskell code itself. Interested in trying to update to the latest version of the underlying library and see if that fixes this?

unode · 2019-08-01T11:26:52Z

Not much has changed in the underlying lib.
The patch at https://gist.github.com/unode/d4369136214c831bd66e628261049900 highlights the differences (split in two commits to highlight code vs boilerplate changes)

luispedro · 2019-08-02T07:57:18Z

I played a bit more with this, the cbits are only used on Windows; on Linux, the library is linked against the system's bz2, so it should not matter.

luispedro · 2019-08-02T13:41:04Z

I now think that this is caused by the Haskell interface not correctly handling the case where bzip files are concatenated together (while other tools appear to produce the concatenation of the inputs).

Curiously enough, I had reported this exact phenomenon on the gzip conduit snoyberg/conduit#254

luispedro · 2019-08-02T18:04:08Z

A further airport hacking session confirmed that it's about multiple concatenated streams in a single file.

Depending on how delayed my flight is, I will fix this in a bit or after the week-end.

Unlike with the gzip conduit¸ I propose to change the API so that a new decompress1 function extracts a single stream, and decompress/bunzip2 extracts all. This is because, unlike in the gzip case, there is no backwards-compatibility argument: currently it crashes on multiple streams.

sample5.bz2 is simply "cat sample1.bz2 sample1.bz2" (and sample5.ref is "cat sample1.ref sample1.ref"). This is handled by bzip2 tools (including the official tool and wrapper such as the Python wrapper).

snoyberg · 2019-08-05T02:53:29Z

src/Data/Conduit/BZlib.hs

+            if ret == c'BZ_STREAM_END
+                then do
+                    dataIn <- liftIO $ peek $ p'bz_stream'next_in ptr
+                    unread <- liftIO $ S.packCStringLen (dataIn, fromEnum availIn)


fromIntegral is more idiomatic I think.

I actually like fromEnum as more clearly converting to Int, but I'll take your lead as it's your project.

snoyberg · 2019-08-05T02:55:52Z

src/Data/Conduit/BZlib.hs

+    next <- await
+    case next of
+        Nothing -> return ()
+        Just bs -> do


Depending on data source, it's theoretically possible that at end of stream we may receive an empty ByteString. This code would treat such a chunk as the start of a new stream. It may be better to integrate this logic more closely with the multi stream detection above.

That's a fair catch. I'll change the logic to ignore empty ByteStrings

The bzip2 utilities accept files that are concatenations of bzip2 streams. Previously, the Haskell wrapper would throw an error in this case. This adds the decompress1 conduit which extracts just one stream if desired.

luispedro · 2019-08-05T13:05:33Z

Force pushed the requested changes.

snoyberg · 2019-08-05T14:35:03Z

Thanks!

TST Futher failing test

46bf610

sample5.bz2 is simply "cat sample1.bz2 sample1.bz2" (and sample5.ref is "cat sample1.ref sample1.ref"). This is handled by bzip2 tools (including the official tool and wrapper such as the Python wrapper).

luispedro mentioned this pull request Aug 2, 2019

Simplify output #8

Merged

snoyberg requested changes Aug 5, 2019

View reviewed changes

Fix handling of files with multiple streams

0f20d01

The bzip2 utilities accept files that are concatenations of bzip2 streams. Previously, the Haskell wrapper would throw an error in this case. This adds the decompress1 conduit which extracts just one stream if desired.

luispedro force-pushed the bzip2_chunked_bug branch from 6fed8c0 to 0f20d01 Compare August 5, 2019 13:03

snoyberg merged commit 584fad9 into snoyberg:master Aug 5, 2019

luispedro mentioned this pull request Aug 5, 2019

Reading .bz2 files fails to decompress or segfaults ngless-toolkit/ngless#116

Open

luispedro deleted the bzip2_chunked_bug branch August 5, 2019 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG REPORT: Add buggy test case (files created by pbzip2) #7

BUG REPORT: Add buggy test case (files created by pbzip2) #7

luispedro commented Jul 31, 2019

snoyberg commented Aug 1, 2019

unode commented Aug 1, 2019

luispedro commented Aug 2, 2019

luispedro commented Aug 2, 2019

luispedro commented Aug 2, 2019

snoyberg Aug 5, 2019

luispedro Aug 5, 2019

snoyberg Aug 5, 2019

luispedro Aug 5, 2019

luispedro commented Aug 5, 2019

snoyberg commented Aug 5, 2019

BUG REPORT: Add buggy test case (files created by pbzip2) #7

BUG REPORT: Add buggy test case (files created by pbzip2) #7

Conversation

luispedro commented Jul 31, 2019

snoyberg commented Aug 1, 2019

unode commented Aug 1, 2019

luispedro commented Aug 2, 2019

luispedro commented Aug 2, 2019

luispedro commented Aug 2, 2019

snoyberg Aug 5, 2019

Choose a reason for hiding this comment

luispedro Aug 5, 2019

Choose a reason for hiding this comment

snoyberg Aug 5, 2019

Choose a reason for hiding this comment

luispedro Aug 5, 2019

Choose a reason for hiding this comment

luispedro commented Aug 5, 2019

snoyberg commented Aug 5, 2019