Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMC abrubtly not working #222

Open
jrostudent opened this issue Sep 28, 2023 · 13 comments
Open

KMC abrubtly not working #222

jrostudent opened this issue Sep 28, 2023 · 13 comments

Comments

@jrostudent
Copy link

For most of the time I've been using kmc it has been working with few issues. However I recently had to start working with .fastq file data stored on a directory in such a way that I need to use absolute paths (what I believe to be the root cause of the issue anyway).

This has caused me to get two kinds of errors, either:
A)
[jrosen5@c005:~/applied_proj/sandbox]$ bin/kmc -k27 -ci50 "/scratch/jrosen5/applied_proj/sandbox/data/PRCRreads/SRR5088929_1.fastq.gz" histogram . -sm


Stage 1: 94%Killed

or
B)
Error: unknown exception

@marekkokot
Copy link
Contributor

Hi,

I know there are a few issues, like the irritating unknown exception (which in most cases is in fact not unknown, but wrongly propagated, so the user is seeing this nonsense message). We have this fixed, but not published yet. What is the amount of RAM you have on your machine?
Also the -sm should be before the input path, so:

bin/kmc -k27 -sm -ci50 "/scratch/jrosen5/applied_proj/sandbox/data/PRCRreads/SRR5088929_1.fastq.gz" histogram . 

Do you really need this switch? I mean it seems the dataset is quite small, so KMC will probably not use more than the default 12GB of RAM anyway.
I don't think the absolute path could cause these issues, I mean KMC was used with absolute paths for a quite long time and I have never encouraged or heard of any problem rising from the absolute path (but of course I am not saying it is not possible).
Anyway, let me know how much RAM you have or maybe just in case try to run it with a small amount of ram with -m2 (2GB).

@jrostudent
Copy link
Author

Hi,

I know there are a few issues, like the irritating unknown exception (which in most cases is in fact not unknown, but wrongly propagated, so the user is seeing this nonsense message). We have this fixed, but not published yet. What is the amount of RAM you have on your machine? Also the -sm should be before the input path, so:

bin/kmc -k27 -sm -ci50 "/scratch/jrosen5/applied_proj/sandbox/data/PRCRreads/SRR5088929_1.fastq.gz" histogram . 

Do you really need this switch? I mean it seems the dataset is quite small, so KMC will probably not use more than the default 12GB of RAM anyway. I don't think the absolute path could cause these issues, I mean KMC was used with absolute paths for a quite long time and I have never encouraged or heard of any problem rising from the absolute path (but of course I am not saying it is not possible). Anyway, let me know how much RAM you have or maybe just in case try to run it with a small amount of ram with -m2 (2GB).

@jrostudent jrostudent reopened this Oct 4, 2023
@jrostudent
Copy link
Author

Sorry I didnt mean to close and reopen, just reply. I fixed the issue where it was killing it at 97%, however I have yet to find what is causing the unknown exception error. What is interesting is that it only throws the unknown exception with one particular file, I am working with fastp for filtering and paired end merging, all reads that aren't able to be merged are sent to two files reflecting the original files, however the reads that are merged are sent to a third file.

When using kmc on either unmerged file it operates without issue, however when using it on the file of merged reads it throws the unknown exception error. I used a head -50 command to a manual inspection of the file for differences in structure, but they appear to be the same. What steps would you suggest I take to solve this? Thank you for your response by the way!

@marekkokot
Copy link
Contributor

Hi, could you share these files? I will try to reproduce.

@jrostudent
Copy link
Author

the unmerged one that works is a 10GB file, and I'd have to retrieve it from the HPC, would you be ok with me just sharing the merged file that doesn't work? its only about 67MB

@marekkokot
Copy link
Contributor

marekkokot commented Oct 6, 2023

Sure, a smaller file causing issues is even better :)

@jrostudent
Copy link
Author

I had to zip it because Github said it doesn/t support the file type, but it should be a .fastq format when unzipped. Thank you!
fastpPE12.zip

@marekkokot
Copy link
Contributor

Thanks! I think I know the reason.
Here is the very first record:

@SRR5088818.367 HWI:1:X:1:1101:14228:2641 length=51 merged_51_16
CCTAACTTCAACTCACAGAAGATTGTGGCAAACACCCATTAACTTTTCTACACAACTACCATTTCAA
+SRR5088818.367 HWI:1:X:1:1101:14228:2641 length=51
@@@FFDDEHGFFHJEHIGGIEGAHIHHGGJIGGGIJEHJJJJJIIGCCHIBFIHHHFDHDDDDB?@@

note that the header of quality is different than the header of sequence, which I believe is not allowed in fastq format. I mean the qual header should be either: empty (just +) or the same as sequence harder (Wikipedia seems to confirm that (but there are other sources saying the same): https://en.wikipedia.org/wiki/FASTQ_format: "Field 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.").
KMC checks (I am not sure if every sequence and quality header are checked, but for sure some of them are) if this condition is met. If not it fails (of course error message should be different).

I think it's best to keep only the + sign in the quality header line avoiding redundance in the data.

In summary, I think KMC behaviour is OK (except error message).
Let me know what you think.

@jrostudent
Copy link
Author

Thank you so much!

@jrostudent
Copy link
Author

Hey, I wrote a sed command- sed -i 's/merged_[0-9]_[0-9]//g' "$mergeReadout"

to delete the line that differentiated the two headers, for context here is a snippet of the problematic .fastq file now.

@SRR5088929.119.1 119 length=51
GAAAGAACATAGTTTTATTTCCGTGAACTATACTTTTTCCCCAGAAGCTCTAATAATTGGCATTAAAAAA
+SRR5088929.119.1 119 length=51
CCCCCGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGEGGGGGGGGGGGGEGFGGGGGGGGGGGGGGCBBCC

As you can see the two headers are now identical, however kmc still throws the unknown exception error during processing.

@jrostudent
Copy link
Author

Update: I modified the command to include unmerged (basically increases the amount of data in the input file for kmc) and it seemed like there was an improvement because instead of throwing the unknown exception error it actually started stage 1, then threw the following error:

Stage 1: 84%
Stage 1: 85%Error: some error while reading fastq file, please contact authors (kmc_core/fastq_reader.cpp: 844)
Error: Cannot open file histogram.kmc_pre

@jrostudent
Copy link
Author

@marekkokot, hey I just wanted to update you on the status of the error:

  1. I took your advice and used a sed command to edit the fastq file to make it identical to standard fastq format by removing the seq header merged__ string. Here is a code snippet including the sed command, the gzip after, and the kmc command used.

sed -i 's/merged_[0-9]_[0-9]//g' "$mergeReadout"

gzip "$mergeReadout"

kmc -k27 -ci50 "$mergeReadout" histogram .

  1. I repeatedly got the error-
    Stage 1: 84%Error: some error while reading fastq file, please contact authors
    (kmc_core/fastq_reader.cpp: 844)

  2. So i made a bash script to inspect the region around 84/85% of the file and found it to meet standard fastq format. Unfortunately after all these adjustments I am still unabel

@marekkokot
Copy link
Contributor

It seems you are not removing space before "merged".
Try this:

sed -i 's/ merged_[0-9]*_[0-9]*//g' "$mergeReadout"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants