The volume of the deposit sequence file is increase dramatically. Also, the submitter of the sequence file is main responsible for annotating. Although the submitter and public repositories pay attention to making accurate metadata, mistakes can happen. These issues can cause troubles in performing downstream analysis. BMD-SRA tries to differentiate the given sequence files into four categories including
- Meta Genomes
- Amplicons
- Single Amplified Genomes (SAGs)
- Isolated Genomes
For developing this model, some stages were tracked, which listed below:
- Preparing Metadata
- Downloading Sequence Files
- Feature Extraction
- Outlier Detection
- Developing Model
- Evaluation Model
There are two ways for using the outcomes of the study. Generating your own model or Applying the generated model in your project.
There is well-form documentation about preparing training data You can use the extracted features and generate your own model.
The generated model is accessible here. You can use the BMDSRA class and pass just two parameters to make an object.
- The path of the model.
- The path of the scaler.
It is worth mentioning that the BMD-SRA needs access to two files, including FeatureExtraction and Preprocessing. Also, accessing to the xgboost package is essential.
from Codes.BMDSRA import BMDSRA
model_path = "..\\..\\resource\\4-model\\model.json"
scaler_path = "..\\..\\resource\\4-model\\scaler.gz"
model = BMDSRA(model_path, scaler_path)
seq_path = "..\\..\\resource\\2-subsra\\SRR1588386.fastq"
res = model.predict(seq_path)
print(res)
To reach more sample about the running model you can see here