Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing error in WGS single sample workflow #1331

Open
ekiernan opened this issue Jul 15, 2024 · 7 comments
Open

parsing error in WGS single sample workflow #1331

ekiernan opened this issue Jul 15, 2024 · 7 comments

Comments

@ekiernan
Copy link
Contributor

ekiernan commented Jul 15, 2024

This was posted by an external user:

I am harmonizing WGS data for the GREGoR consortium using the WGS single sample workflow (in dragen_mode) on AnVIL. I've hit a parsing error when reprocessing a subset of the consortium data (see attached log file).

In brief, the data, which throws the error, was similarly pre-processed and uploaded to AnVIL in CRAM format. I successfully converted (passed ValidateSam) these crams to ubams and used this as input for the WGS single sample workflow. As you can see in the log, the error comes up during alignment with DRAGMAP. It first prints "When maskLen < 15, the function ssw_align doesn't return 2nd best alignment information." and then throws a parsing error.

Could you please take a look and let me know if this looks familiar or have any additional insights on how to troubleshoot.

Update:
Are there processing steps that are incompatible with REprocessing with WARP. For instance, I know that for these samples, reads were trimmed and duplicates dropped in the initial processing. Could this be causing the error...

@ekiernan
Copy link
Contributor Author

ekiernan commented Jul 15, 2024

We asked the User to:

Focus on is the line "[W::sam_read1] Parse error at line 204289". That's an error that comes from samtools and seems to me that there is something wrong with the input file. It's impossible to say without having the data available, but given that it prints out the line number it shouldn't be too hard to figure out what's odd about that particular line. If they don't want to run the entire file then they can just run it on a subset of the reads, with that problematic line included of course.

The User replied and noticed:

"Read pairs are different lengths which is why I asked about trimming. I can dig to see if this happens consistently across all failures.

Parsing error was for read -

A01139:184:HK5MGDSX7:1:1101:28013:4131 A01139:184:HK5MGDSX7:1:1101:28013:4131 77 * 0 0 * * 0 0 CTTCTCCATCCAAAGGAATGTTCAGCTCTGTGTGTTAAACTCAATCATCACAAAGTATTTTCTGAGAATGCTTCTGTCTAGATTTTATGTGAAGCTCTTCCCTTTACTACCATAGGCCTCAAAGCGCTCCAAATCTCCACTAGCCGATTCT ?BDBC@?AA@?ABBAAABA?@B@AA@A@A@@@A@ACABC>BABCBABBAB?BCCBA;BCCCABABBBCBAABCABAAABABBBCCCABAAABCBABBBCAAACDCA@CA?@CCBBBBACACDDCB;BCBACDDCBCBAC@CBCBA;DDDBB RG:Z:RGL001 XS:i:151

A01139:184:HK5MGDSX7:1:1101:28013:4131 141 * 0 0 * * 0 0 GTAGAATCGGCTAGTGGAGATTTGGAGCGCTTTGAGGCCTATGGTAGTAAAGGGAAGAGCTTCACATAAAATCTAGACAGAAGCATTCTCAGAAAATACTTTGTGATGATTGAGTTTAACACACAGAGCTGAACATTCCTTTGGATGG >@ABC;@@9A?@?@@?AAA@@AA?@@A@9@AB;@AAA@?A@A@A@@A@@BBAAAA9AAA@AB@A>AA@BBBA@A@AA>AAABA@AAB@A@AAABBBA@=ABB@@@A8@@AA?AA@BB@B>@=@=@@@6?@?@A=@AB@>@BB@5@@@@ RG:Z:RGL001 XS:i:148
"

The User also mentioned that the mismatch in read pairs isn't consistent across parsing errors.

@michaelgatzen
Copy link
Contributor

Are we sure that this is the line that causes issues? I don't know if "line 204289" refers to the line number with or without header lines. What happens if they only pass that read pair into the aligner? Does it fail?

If not, does it fail if they pass plus/minus 10,000 read pairs around line 204289 into the aligner?

@kachulis
Copy link
Contributor

the other thing I'd suggest is to split apart the individual parts in the dragmap --> samtools --> mergebamalignments pipe in SamToFastqAndDragmapAndMba and run each piece separately. So just run dragmap on the whole file first, and then if that completes successfully, run the resulting file through samtools view. Presumably that will hit this error, given the samtools view error message mentioned in the logs you emailed. At that point it should be easier to look at the read causing the issue and see if there is some obvious problem. Note that the reads mentioned above are pre-alignment, as in the reads that went into dragmap, not what actually went into samtools, so probably not exactly the data in the form it was when it caused the error.

If there's nothing obvious there, the fastest way to find the solution is probably to run the data through samtools in a debugger. Depending on the user's familiarity with the samtools source code and/or software debugging in general, that may or may not be feasible. If they get to the debugging step but don't feel that's something they can do, we could try to take a look if we can find the time, if they are able to share the dragmap output bam with us (that's a big "if").

@kachulis
Copy link
Contributor

ultimately, this will probably end up pointing us to some bug in dragmap, though, so will likely either need a workaround, or illumina to fix

@mmwheel
Copy link

mmwheel commented Jul 24, 2024

Thanks for the comments/suggestions. I've passed only the read pair corresponding to line 204289 to the workflow as well as slices of the ubam (including one with X read pairs around line 204289). All these attempts aligned successfully. After looking at log files across ubam shards and across samples the parsing error comes up around line 200K which feels like resource limitation to me.

@mmwheel
Copy link

mmwheel commented Sep 3, 2024

@kachulis @michaelgatzen I am running into a similar error in reprocessing WGS data initially processed with dragen v3.7.8. The files initially fail the Picard RevertSam step of CRAM-to-uBAM file conversion due to formatting of a user-defined tag (XQ - Picard expects the XQ tag to be a string but is an integer). To convert the files, I set the RevertSam [--RESTORE_HARDCLIPS] flag to 'false' which ignores the XQ tag; this successfully coverts the CRAM to a validated uBAM but it subsequently fails in the SamToFastqAndDragmapAndMba task. As stated above, I suspect this is a resource limitation and can adjust disk/memory for this particular task but was wondering if you had run into the XQ issues and had advice on how to handle them. Many thanks,

@jessicaway
Copy link
Member

@mmwheel is this still an issue for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants