-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG REPORT: Add buggy test case (files created by pbzip2) #7
Conversation
Originally reported as a bug in NGLess (see ngless-toolkit/ngless#116). After the original report, @unode provided the following analysis: > If using pbzip2 the parallel version of bzip2 to create the files, > ngless is able to consume the files up to a certain size. In the > test-case I setup locally a Fastq file with 9724 lines, (266413 bytes > compressed, 900170 uncompressed) causes ngless to fail with > BZ2_bzDecompress: -1. Regular unix bzip2 is able to decompress the file > without problems. > > On the other hand if using regular bzip2, tried as many as 90000 lines > and ngless is still able to consume the files without error.
It sounds likely that this is an issue stemming from the underlying C library, not from the Haskell code itself. Interested in trying to update to the latest version of the underlying library and see if that fixes this? |
Not much has changed in the underlying lib. |
I played a bit more with this, the |
I now think that this is caused by the Haskell interface not correctly handling the case where bzip files are concatenated together (while other tools appear to produce the concatenation of the inputs). Curiously enough, I had reported this exact phenomenon on the gzip conduit snoyberg/conduit#254 |
A further airport hacking session confirmed that it's about multiple concatenated streams in a single file. Depending on how delayed my flight is, I will fix this in a bit or after the week-end. Unlike with the |
sample5.bz2 is simply "cat sample1.bz2 sample1.bz2" (and sample5.ref is "cat sample1.ref sample1.ref"). This is handled by bzip2 tools (including the official tool and wrapper such as the Python wrapper).
src/Data/Conduit/BZlib.hs
Outdated
if ret == c'BZ_STREAM_END | ||
then do | ||
dataIn <- liftIO $ peek $ p'bz_stream'next_in ptr | ||
unread <- liftIO $ S.packCStringLen (dataIn, fromEnum availIn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromIntegral is more idiomatic I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like fromEnum
as more clearly converting to Int
, but I'll take your lead as it's your project.
src/Data/Conduit/BZlib.hs
Outdated
next <- await | ||
case next of | ||
Nothing -> return () | ||
Just bs -> do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on data source, it's theoretically possible that at end of stream we may receive an empty ByteString. This code would treat such a chunk as the start of a new stream. It may be better to integrate this logic more closely with the multi stream detection above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair catch. I'll change the logic to ignore empty ByteString
s
The bzip2 utilities accept files that are concatenations of bzip2 streams. Previously, the Haskell wrapper would throw an error in this case. This adds the decompress1 conduit which extracts just one stream if desired.
6fed8c0
to
0f20d01
Compare
Force pushed the requested changes. |
Thanks! |
This is more of a bug report than a real PR in that it adds a test case that fails, but I have no fix ATM.
Originally reported as a bug in NGLess (see ngless-toolkit/ngless#116). After the original report, @unode provided the following analysis:
There is more detail at ngless-toolkit/ngless#116