-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference between falco and fastqc in reported polyA/polyG content #69
Comments
You're right @wm75 I'll see about fixing it this week. |
Did you have any time for this yet @andrewdavidsmith ? |
@wm75 No, but I can give a quick fix in a separate branch that will lose some speed. Otherwise I might be able to get it done in a week, but school is back in and I'm a professor. From my perspective, the issue is mostly that the fastest way to fix it (and keep the current efficiency) would take the code off a reasonable path of maintainability. However, since you're clearly still interested, that makes it worthwhile for me to do, even if I need to find a different solution down the line. So here's what I can do:
I would only do (1) above if you would use it from a clone of a particular branch, which means building from source. Number (3) would obviously only be noticeable to me or whomever might be maintaining the code in the future. |
Thanks @andrewdavidsmith for providing some context here, so lets do the same from our (the Galaxy Project's) side: On public Galaxy servers (usegalaxy.org, usegalaxy.eu, usegalaxy.org.au, etc.) FastQC is invariably among the handful of top tools with regard to total compute time (summed up over all users) that seems worth optimization. Now how does this background fit against your proposed solutions: The separate branch you propose in 1. could certainly be useful in terms of benchmarking, but we cannot offer an unreleased version on our servers. So if number 2 is within reach and you can imagine a proper release with that fix, I would say you can skip number 1 and we just wait a bit longer? |
@wm75 Could you test an unofficial release before I release it? I think I see a better way to do this than I had previously, so it might be quicker. I don't know how many test cases you might have lined up. Also, does the precision of the output matter (sorry if already asked)? I see above that the precision is higher in the FastQC output. I thought we had made falco equal, but it's a moving target obviously. |
Thanks so much @andrewdavidsmith! |
I was about to say perfect work here, @andrewdavidsmith, then decided to run a final test with BAM input. For fastq inputs, however, things look very good now and processing is still fast. |
@wm75 Thanks very much for this -- I'll check it out asap. |
@wm75 I have a quick fix in a separate branch bam-adapter-bugfix. I don't have time to test right now, but if you either send me your test data (if that's ok and you have the time) or if you can test it yourself, it would accelerate. (For documentation) The difference is in one line of code in the function that reads each record from BAM format. The indicator vector for "specific adapter found in current read" was not reset for BAM reads. Here a48cb66 you can see the change. It should be similar to line 440 in the same file highlighted here where the reset was done for non-BAM file formats. If the logic above is correct, we have a fix. Otherwise it's a deeper issue. I don't currently have BAM format among the regression tests, so I'll try to do that when I get a chance to test. If you test sooner, let me know; if it works I'll make a PR to merge the branch, then move to release. |
@andrewdavidsmith you were spot on with your change! Regarding test data for the bam case I can strip down some of my datasets and provide them to you. I don't think I'll get to it this week anymore, but I'll do it as soon as possible. |
@wm75 No worries on the dataset. If you find yourself with a small one that won't identify any project info, I'm happy to but it alongside the existing regression test cases. In any case: I appreciate your help with this! I'll close this issue and make a release hopefully today or tomorrow. |
The output of falco (1.2.3) and fastqc (0.12.1) is markedly different for polyA stats (and I assume would also be for polyG, but haven't tested).
I haven't looked into the actual code producing them, but compared to fastqc the values from falco seem to increase much faster, here's an example of the corresponding section of the reports:
First fastqc:
next falco:
My first guess would be that falco counts longer polyA runs multiple times?
The input data was https://raw.githubusercontent.com/galaxyproject/tools-iuc/6b50408a1ff7902575be37b2fa21aa80fe684e5c/tools/falco/test-data/1000trimmed.fastq and both tools where run with just default settings.
The text was updated successfully, but these errors were encountered: