Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between falco and fastqc in reported polyA/polyG content #69

Closed
wm75 opened this issue Aug 19, 2024 · 12 comments
Closed

Difference between falco and fastqc in reported polyA/polyG content #69

wm75 opened this issue Aug 19, 2024 · 12 comments

Comments

@wm75
Copy link

wm75 commented Aug 19, 2024

The output of falco (1.2.3) and fastqc (0.12.1) is markedly different for polyA stats (and I assume would also be for polyG, but haven't tested).
I haven't looked into the actual code producing them, but compared to fastqc the values from falco seem to increase much faster, here's an example of the corresponding section of the reports:

First fastqc:

#Position	Illumina Universal Adapter	Illumina Small RNA 3' Adapter	Illumina Small RNA 5' Adapter	Nextera Transposase Sequence	PolyA	PolyG
1	0.0	0.0	0.0	0.0	0.020387359836901122	0.0
2	0.0	0.0	0.0	0.0	0.06116207951070336	0.0
3	0.0	0.0	0.0	0.0	0.06116207951070336	0.0
4	0.0	0.0	0.0	0.0	0.06116207951070336	0.0
5	0.0	0.0	0.0	0.0	0.12232415902140673	0.0
6	0.0	0.0	0.0	0.0	0.12232415902140673	0.0
7	0.0	0.0	0.0	0.0	0.12232415902140673	0.0
8	0.0	0.0	0.0	0.0	0.14271151885830785	0.0
9	0.0	0.0	0.0	0.0	0.14271151885830785	0.0
10-11	0.0	0.0	0.0	0.0	0.14271151885830785	0.0
12-13	0.0	0.0	0.0	0.0	0.1529051987767584	0.0
14-15	0.0	0.0	0.0	0.0	0.17329255861365955	0.0
16-17	0.0	0.0	0.0	0.0	0.21406727828746175	0.0
18-19	0.0	0.0	0.0	0.0	0.22426095820591233	0.0
20-21	0.12232415902140673	0.0	0.0	0.0	0.24464831804281345	0.0
22-23	0.12232415902140673	0.0	0.0	0.0	0.24464831804281345	0.0
24-25	0.12232415902140673	0.0	0.0	0.0	0.24464831804281345	0.0
26-27	0.1325178389398573	0.0	0.0	0.0	0.24464831804281345	0.0
28-29	0.2038735983690112	0.0	0.0	0.0	0.24464831804281345	0.0
30-31	0.22426095820591233	0.0	0.0	0.0	0.24464831804281345	0.0
32-33	0.22426095820591233	0.0	0.0	0.0	0.2650356778797146	0.0
34-35	0.24464831804281345	0.0	0.0	0.0	0.29561671763506625	0.0
36-37	0.3058103975535168	0.0	0.0	0.0	0.31600407747196735	0.0
38-39	0.4383282364933741	0.0	0.0	0.0	0.32619775739041795	0.0
40-41	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
42-43	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
44-45	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
46-47	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
48-49	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
50-51	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
52-53	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
54-55	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
56-57	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
58-59	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
60-61	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
62-63	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
64-65	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
66-67	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
68-69	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
70-71	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
72-73	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
74-75	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
76-77	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
78-79	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
80-81	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
82-83	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
84-85	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
86-87	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
88-89	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
90-91	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
92-93	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
94-95	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0
96-97	0.4689092762487258	0.0	0.0	0.0	0.32619775739041795	0.0

next falco:

#Position	Illumina Universal Adapter	Illumina Small RNA 3' Adapter	Illumina Small RNA 5' Adapter	Nextera Transposase Sequence	PolyA	PolyG
1	0	0	0	0	0.0203874	0
2	0	0	0	0	0.0815494	0
3	0	0	0	0	0.142712	0
4	0	0	0	0	0.183486	0
5	0	0	0	0	0.285423	0
6	0	0	0	0	0.38736	0
7	0	0	0	0	0.489297	0
8	0	0	0	0	0.591233	0
9	0	0	0	0	0.672783	0
10	0	0	0	0	0.754332	0
11	0	0	0	0	0.835882	0
12	0	0	0	0	0.917431	0
13	0	0	0	0	1.01937	0
14	0	0	0	0	1.1213	0
15	0	0	0	0	1.24363	0
16	0	0	0	0	1.34557	0
17	0	0	0	0	1.46789	0
18	0	0	0	0	1.59021	0
19	0	0	0	0	1.67176	0
20	0.122324	0	0	0	1.75331	0
21	0.122324	0	0	0	1.83486	0
22	0.122324	0	0	0	1.89602	0
23	0.122324	0	0	0	1.95719	0
24	0.122324	0	0	0	2.01835	0
25	0.122324	0	0	0	2.07951	0
26	0.122324	0	0	0	2.14067	0
27	0.142712	0	0	0	2.20183	0
28	0.183486	0	0	0	2.263	0
29	0.224261	0	0	0	2.32416	0
30	0.224261	0	0	0	2.38532	0
31	0.224261	0	0	0	2.44648	0
32	0.224261	0	0	0	2.50765	0
33	0.224261	0	0	0	2.60958	0
34	0.224261	0	0	0	2.71152	0
35	0.265036	0	0	0	2.81346	0
36	0.285423	0	0	0	2.89501	0
37	0.326198	0	0	0	2.99694	0
38	0.407747	0	0	0	3.09888	0
39	0.468909	0	0	0	3.20082	0
40	0.468909	0	0	0	3.30275	0
41	0.468909	0	0	0	3.40469	0
42	0.468909	0	0	0	3.48624	0
43	0.468909	0	0	0	3.56779	0
44	0.468909	0	0	0	3.60856	0
45	0.468909	0	0	0	3.62895	0
46	0.468909	0	0	0	3.64934	0
47	0.468909	0	0	0	3.66972	0
48	0.468909	0	0	0	3.69011	0
49	0.468909	0	0	0	3.7105	0
50	0.468909	0	0	0	3.7105	0
51	0.468909	0	0	0	3.7105	0
52	0.468909	0	0	0	3.7105	0
53	0.468909	0	0	0	3.7105	0
54	0.468909	0	0	0	3.7105	0
55	0.468909	0	0	0	3.7105	0
56	0.468909	0	0	0	3.7105	0
57	0.468909	0	0	0	3.7105	0
58	0.468909	0	0	0	3.7105	0
59	0.468909	0	0	0	3.73089	0
60	0.468909	0	0	0	3.75127	0
61	0.468909	0	0	0	3.77166	0
62	0.468909	0	0	0	3.81244	0
63	0.468909	0	0	0	3.85321	0
64	0.468909	0	0	0	3.89399	0
65	0.468909	0	0	0	3.93476	0
66	0.468909	0	0	0	3.97554	0
67	0.468909	0	0	0	4.01631	0
68	0.468909	0	0	0	4.05708	0
69	0.468909	0	0	0	4.09786	0
70	0.468909	0	0	0	4.13863	0
71	0.468909	0	0	0	4.17941	0
72	0.468909	0	0	0	4.22018	0
73	0.468909	0	0	0	4.26096	0
74	0.489297	0	0	0	4.32212	0
75	0.489297	0	0	0	4.38328	0
76	0.489297	0	0	0	4.42406	0
77	0.489297	0	0	0	4.46483	0
78	0.489297	0	0	0	4.50561	0
79	0.489297	0	0	0	4.54638	0
80	0.489297	0	0	0	4.58716	0
81	0.489297	0	0	0	4.62793	0
82	0.489297	0	0	0	4.66871	0
83	0.509684	0	0	0	4.70948	0
84	0.509684	0	0	0	4.75025	0
85	0.509684	0	0	0	4.79103	0
86	0.509684	0	0	0	4.8318	0
87	0.509684	0	0	0	4.91335	0
88	0.509684	0	0	0	4.9949	0
89	0.509684	0	0	0	5.05607	0
90	0.509684	0	0	0	5.09684	0
91	0.509684	0	0	0	5.158	0
92	0.570846	0	0	0	5.21916	0
93	0.632008	0	0	0	5.28033	0
94	0.632008	0	0	0	5.34149	0
95	0.632008	0	0	0	5.40265	0
96	0.632008	0	0	0	5.46381	0
97	0.632008	0	0	0	5.52497	0
98	0.632008	0	0	0	5.52497	0
99	0.632008	0	0	0	5.52497	0
100	0.632008	0	0	0	5.52497	0
101	0.632008	0	0	0	5.52497	0
102	0.632008	0	0	0	5.52497	0
103	0.632008	0	0	0	5.52497	0
104	0.632008	0	0	0	5.52497	0
105	0.632008	0	0	0	5.52497	0
106	0.632008	0	0	0	5.52497	0
107	0.632008	0	0	0	5.52497	0
108	0.632008	0	0	0	5.52497	0

My first guess would be that falco counts longer polyA runs multiple times?

The input data was https://raw.githubusercontent.com/galaxyproject/tools-iuc/6b50408a1ff7902575be37b2fa21aa80fe684e5c/tools/falco/test-data/1000trimmed.fastq and both tools where run with just default settings.

@andrewdavidsmith
Copy link
Collaborator

You're right @wm75 I'll see about fixing it this week.

@wm75
Copy link
Author

wm75 commented Sep 17, 2024

Did you have any time for this yet @andrewdavidsmith ?

@andrewdavidsmith
Copy link
Collaborator

@wm75 No, but I can give a quick fix in a separate branch that will lose some speed. Otherwise I might be able to get it done in a week, but school is back in and I'm a professor. From my perspective, the issue is mostly that the fastest way to fix it (and keep the current efficiency) would take the code off a reasonable path of maintainability. However, since you're clearly still interested, that makes it worthwhile for me to do, even if I need to find a different solution down the line. So here's what I can do:

  1. Make a separate branch that fixes this issue, but likely loses some speed (I don't know how much) (time: few days).
  2. Fix the issue in an ad hoc way that will only solve this particular problem and make the code less flexible (1-2 weeks?).
  3. Refactor so that the solution in 2 becomes maintainable generally (might take a while to get to this).

I would only do (1) above if you would use it from a clone of a particular branch, which means building from source. Number (3) would obviously only be noticeable to me or whomever might be maintaining the code in the future.

@wm75
Copy link
Author

wm75 commented Sep 18, 2024

Thanks @andrewdavidsmith for providing some context here, so lets do the same from our (the Galaxy Project's) side:

On public Galaxy servers (usegalaxy.org, usegalaxy.eu, usegalaxy.org.au, etc.) FastQC is invariably among the handful of top tools with regard to total compute time (summed up over all users) that seems worth optimization.
That's why we decided this summer to take a closer look at falco to see if we should advertise it more actively as a replacement for fastqc. Currently the tool is installed as one of thousands of tools on our servers and most users will never find it, but we could promote it in various trainings, through workflows that we are publishing and a couple of other measures. Before we do any of this, however, we need to be sure that falco can replace FastQC without side-effects, and that's why I have spent some time lately on reporting bugs and issues here.

Now how does this background fit against your proposed solutions:

The separate branch you propose in 1. could certainly be useful in terms of benchmarking, but we cannot offer an unreleased version on our servers.
Your 2. proposal sounds quite ok. We are in no particular hurry with the switch so a couple of weeks are totally acceptable. Of course, you have to judge how painful the ad hoc solution is for you (could imagine that FastQC also special-cases the polyX runs though I haven't checked).
Obviously, 3. sounds like the perfect solution in the long term.

So if number 2 is within reach and you can imagine a proper release with that fix, I would say you can skip number 1 and we just wait a bit longer?

@andrewdavidsmith
Copy link
Collaborator

@wm75 Could you test an unofficial release before I release it? I think I see a better way to do this than I had previously, so it might be quicker. I don't know how many test cases you might have lined up.

Also, does the precision of the output matter (sorry if already asked)? I see above that the precision is higher in the FastQC output. I thought we had made falco equal, but it's a moving target obviously.

@andrewdavidsmith
Copy link
Collaborator

@wm75 I think it is fixed in the master branch as of commit 99ea94f and PR #71. I will need to do more testing before I make a new release, but if you can run any tests using the master, it would accelerate that process.

@wm75
Copy link
Author

wm75 commented Sep 24, 2024

Thanks so much @andrewdavidsmith!
Checking things now.

@wm75
Copy link
Author

wm75 commented Sep 24, 2024

I was about to say perfect work here, @andrewdavidsmith, then decided to run a final test with BAM input.
For some reason when the input is in BAM format all the adapter content values are very different from what they are with fastqc.
This is independent on whether reads are mapped to the fw or to the rv strand and also on whether the reads have been mapped at all, so I have no clue this time why the difference exists.

For fastq inputs, however, things look very good now and processing is still fast.

@andrewdavidsmith
Copy link
Collaborator

@wm75 Thanks very much for this -- I'll check it out asap.

@andrewdavidsmith
Copy link
Collaborator

@wm75 I have a quick fix in a separate branch bam-adapter-bugfix.

I don't have time to test right now, but if you either send me your test data (if that's ok and you have the time) or if you can test it yourself, it would accelerate.

(For documentation) The difference is in one line of code in the function that reads each record from BAM format. The indicator vector for "specific adapter found in current read" was not reset for BAM reads. Here a48cb66 you can see the change. It should be similar to line 440 in the same file highlighted here where the reset was done for non-BAM file formats.

If the logic above is correct, we have a fix. Otherwise it's a deeper issue. I don't currently have BAM format among the regression tests, so I'll try to do that when I get a chance to test. If you test sooner, let me know; if it works I'll make a PR to merge the branch, then move to release.

@wm75
Copy link
Author

wm75 commented Sep 25, 2024

@andrewdavidsmith you were spot on with your change!
The bam-adapter-bugfix branch behaves just like fastqc also for bam input now. Thanks for the fast fix!

Regarding test data for the bam case I can strip down some of my datasets and provide them to you. I don't think I'll get to it this week anymore, but I'll do it as soon as possible.

@andrewdavidsmith
Copy link
Collaborator

@wm75 No worries on the dataset. If you find yourself with a small one that won't identify any project info, I'm happy to but it alongside the existing regression test cases. In any case: I appreciate your help with this! I'll close this issue and make a release hopefully today or tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants