SENSE

Siamese neural network for sequence embedding

Code Organization

This repo constains three major components:

siamese.ipynb
select_training_data
DNA_Align
tools
demo

siamese.ipynb is a notebook that contains the model definition and implementation in Pytorch. select_training_data is the C++ implementation of the active landmark selection algorithm for preparing training data for SENSE. DNA_Align constains the binary for evaluating the embedding results. tools constains some useful python utilities. demo constains data for demonstration.

Requirements

Clang
Cmake
Boost
Pytorch
CUDA

CUDA is needed only if you need GPU acceleration, but we highly recommend using it. The installation of these tools can be found on their official websites.

Compile

To compile /DNA_Align/:

cd DNA_Align
mkdir build && cd build
cmake .. && make

The executable binary should be under DNA_Align/build/src/.

To compile /select_training_data/:

cd select_training_data
mkdir build && cd build
cmake .. && make

The executable binary should be under select_training_data/build/src/.

Demo

The demo dataset contains 10,000 sequences sampled from RT988 dataset. In this demo, we sampled 500 out of 10,000 sequences and compute their pairwise distances for evaluation. This process may take a long time due to sequence alignment. To prepare the evaluation data:

python tools/sample.py -i demo/seqs.fa -o demo/eval.fa -s 0 -n 500
python tools/pair.py -i demo/eval.fa -o demo/eval_pair.fa
./DNA_Align/build/src/nw demo/eval_pair.fa demo/eval_aligned.fa
python tools/dist.py -i demo/eval_aligned.fa -o demo/eval_dist.txt

In this demo, we prepare 20 * 500 training sequence paris and shuffle them. To select training data:

./select_training_data/build/src/select_training_data -f demo/seqs.fa -s demo/seqs_ids.txt -p demo/pair.fa -d demo/dist.txt -a 1 -t 20 -n 500
python tools/shuffle.py -p demo/pair.fa -d demo/dist.txt -s 0

Here is the help for the options:

options.add_options()
  ("f,fasta_file", "input fasta file", cxxopts::value<std::string>())
  ("s,seq_ids_file", "output seq ids file", cxxopts::value<std::string>())
  ("p,pairs_file", "output pairs file", cxxopts::value<std::string>())
  ("d,dists_file", "output dists file", cxxopts::value<std::string>())
  ("a,abundance_threshold", "abundance threshold", cxxopts::value<std::size_t>())
  ("t,target_num_landmarks", "target number of landmarks", cxxopts::value<std::size_t>())
  ("n,num_random_sample", "number of random sample", cxxopts::value<std::size_t>())

Run the jupyter notebook for defining, training and evaluating the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SENSE

Code Organization

Requirements

Compile

Demo

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
DNA_Align		DNA_Align
demo		demo
select_training_data		select_training_data
tools		tools
.gitignore		.gitignore
README.md		README.md
siamese.ipynb		siamese.ipynb

vitmy0000/SENSE

Folders and files

Latest commit

History

Repository files navigation

SENSE

Code Organization

Requirements

Compile

Demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages