Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqkit grep count instances of each sequence in id.txt separately #495

Open
2 tasks done
padpadpadpad opened this issue Nov 22, 2024 · 3 comments
Open
2 tasks done

Comments

@padpadpadpad
Copy link

Please check the items below before submitting an issue.
They help to improve the communication efficiency between us.
Thanks!

Prerequisites

  • Make sure you are using the latest version by seqkit version -u.
  • Read the usage and examples for the specific subcommand.

Describe your issue in detail

I am trying to run seqkit grep to count the instances of specific sequences in a set of reads. I would really love for grep to count the instances of each line in id.txt separately, but this is not something I can work out how to do.

I am running:

check=~/Desktop/id.txt
seqs=~/Desktop/raw_seqs.fasta

seqkit grep -f $check --pattern-file $check $seqs -s | seqkit stats

And it just returns 1000. I would love to know the number of reads matched to each sequence in id.txt individually if that is possible. Data attached below.

Archive.zip

@shenwei356
Copy link
Owner

shenwei356 commented Nov 22, 2024

seqkit locate + csvtk freq should be faster, the reads file is read only once:

# `seqkit locate -f` needs a fasta file.
# `seqkit locate` search sequences on both strands.
$ seqkit locate -M -f <(csvtk mutate -Ht id.txt | seqkit tab2fx) raw_seqs.fasta \
    | csvtk freq -t -f 3 | csvtk pretty -t
pattern           frequency
---------------   ---------
ATTTAACAAGCGTGG   102      
CGTGGTACCGGGCCG   115      
GCTAAACGAGAGCTG   783 

seqkit grep + while loop:

$ cat id.txt | while read p; do \
        echo -e $p"\t"$(seqkit grep -s --count -p $p raw_seqs.fasta); \
  done
GCTAAACGAGAGCTG 783
ATTTAACAAGCGTGG 102
CGTGGTACCGGGCCG 115

If the reads file is big, parallelize it:

$ cat id.txt | rush 'echo -e {}"\t"$(seqkit grep -s --count -p {} raw_seqs.fasta)'
GCTAAACGAGAGCTG 783
ATTTAACAAGCGTGG 102
CGTGGTACCGGGCCG 115

@padpadpadpad
Copy link
Author

Thanks @shenwei356 this is great. If i wanted to pass file names as variables, as in:

check=~/id.txt
seqs=~/raw_seqs.fasta

cat $check | rush 'echo -e {}"\t"$(seqkit grep -s --count -p {} $seqs)'

I currently get the error [ERRO] fastx: stdin not detected but I am not sure how to troubleshoot this. Probably because of something in rush I do not understand.

@shenwei356
Copy link
Owner

shenwei356 commented Nov 23, 2024

Use -v to define a variable.

cat $check | rush -v "seqs=$seqs" 'echo -e {}"\t"$(seqkit grep -s --count -p {} {seqs})' --eta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants