seqkit grep count instances of each sequence in id.txt separately #495

padpadpadpad · 2024-11-22T14:42:22Z

Please check the items below before submitting an issue.
They help to improve the communication efficiency between us.
Thanks!

Prerequisites

Make sure you are using the latest version by seqkit version -u.
Read the usage and examples for the specific subcommand.

Describe your issue in detail

I am trying to run seqkit grep to count the instances of specific sequences in a set of reads. I would really love for grep to count the instances of each line in id.txt separately, but this is not something I can work out how to do.

I am running:

check=~/Desktop/id.txt
seqs=~/Desktop/raw_seqs.fasta

seqkit grep -f $check --pattern-file $check $seqs -s | seqkit stats

And it just returns 1000. I would love to know the number of reads matched to each sequence in id.txt individually if that is possible. Data attached below.

Archive.zip

The text was updated successfully, but these errors were encountered:

shenwei356 · 2024-11-22T15:00:19Z

seqkit locate + csvtk freq should be faster, the reads file is read only once:

# `seqkit locate -f` needs a fasta file.
# `seqkit locate` search sequences on both strands.
$ seqkit locate -M -f <(csvtk mutate -Ht id.txt | seqkit tab2fx) raw_seqs.fasta \
    | csvtk freq -t -f 3 | csvtk pretty -t
pattern           frequency
---------------   ---------
ATTTAACAAGCGTGG   102      
CGTGGTACCGGGCCG   115      
GCTAAACGAGAGCTG   783

seqkit grep + while loop:

$ cat id.txt | while read p; do \
        echo -e $p"\t"$(seqkit grep -s --count -p $p raw_seqs.fasta); \
  done
GCTAAACGAGAGCTG 783
ATTTAACAAGCGTGG 102
CGTGGTACCGGGCCG 115

If the reads file is big, parallelize it:

$ cat id.txt | rush 'echo -e {}"\t"$(seqkit grep -s --count -p {} raw_seqs.fasta)'
GCTAAACGAGAGCTG 783
ATTTAACAAGCGTGG 102
CGTGGTACCGGGCCG 115

padpadpadpad · 2024-11-23T10:59:06Z

Thanks @shenwei356 this is great. If i wanted to pass file names as variables, as in:

check=~/id.txt
seqs=~/raw_seqs.fasta

cat $check | rush 'echo -e {}"\t"$(seqkit grep -s --count -p {} $seqs)'

I currently get the error [ERRO] fastx: stdin not detected but I am not sure how to troubleshoot this. Probably because of something in rush I do not understand.

shenwei356 · 2024-11-23T11:09:57Z

Use -v to define a variable.

cat $check | rush -v "seqs=$seqs" 'echo -e {}"\t"$(seqkit grep -s --count -p {} {seqs})' --eta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seqkit grep count instances of each sequence in id.txt separately #495

seqkit grep count instances of each sequence in id.txt separately #495

padpadpadpad commented Nov 22, 2024

shenwei356 commented Nov 22, 2024 •

edited

Loading

padpadpadpad commented Nov 23, 2024

shenwei356 commented Nov 23, 2024 •

edited

Loading

seqkit grep count instances of each sequence in id.txt separately #495

seqkit grep count instances of each sequence in id.txt separately #495

Comments

padpadpadpad commented Nov 22, 2024

Prerequisites

Describe your issue in detail

shenwei356 commented Nov 22, 2024 • edited Loading

padpadpadpad commented Nov 23, 2024

shenwei356 commented Nov 23, 2024 • edited Loading

shenwei356 commented Nov 22, 2024 •

edited

Loading

shenwei356 commented Nov 23, 2024 •

edited

Loading