Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlap output #8

Merged
merged 16 commits into from
Oct 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
debug \
relwithdebinfo \
release \
python-dev \
clean

GEN:="Unix Makefiles"
Expand All @@ -11,6 +12,17 @@ ifdef NINJA_PATH
GEN:="Ninja"
endif

venv: requirements.txt
@python3 -m venv ./venv
@bash -c 'source venv/bin/activate; pip install -U pip -r requirements.txt';

python-dev: venv
@bash -c \
'source venv/bin/activate; pip install -U pip -r requirements-dev.txt';

clean-venv:
rm -rf venv;

build-debug: conanfile.txt
conan install . --output-folder=$@ \
--build=missing \
Expand Down Expand Up @@ -51,15 +63,16 @@ build: conanfile.txt
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DCMAKE_TOOLCHAIN_FILE=conan_toolchain.cmake;

release: build
release: venv build
cd build; \
cmake --build .;

clean-release:
rm -rf build;

clean: clean-debug clean-relwithdebinfo clean-release
clean: clean-venv clean-debug clean-relwithdebinfo clean-release
@:


all: debug relwithdebinfo release
@:
59 changes: 20 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,40 +6,11 @@ Heuristic tool for pairing up reverse complement reads in fasta/fastq files.

## Methods

Sniff loads sequences in batches displaying progress along the way. Once all sequences have been loaded into memory they are again processed in batches. Each batch is used for construcing a target index from reverse complemented reads. Original reads are then mapped against the constructed index
using a seed and chain approach with minor modifications. For each read we remember the strongest matching reverse complement read and output reverse complement pairs. Each read appears in at most one pair.

## Usage

After building:

```bash
pair up potential reverse complement reads
Usage:
sniff [OPTION...] <reads>

general options:
-h, --help print help
-v, --version print version
-t, --threads arg number of threads to use (default: 1)

heuristic options:
-a, --alpha arg shorter read length as percentage of longer read lenght
in pair (default: 0.10)
-b, --beta arg minimum required coverage on each read (default: 0.98)

input options:
--input arg input fasta/fastq file

mapping options:
-k, --kmer-length arg kmer length used in mapping (default: 15)
-w, --window-length arg window length used in mapping (default: 5)
-f, --frequent arg filter f most frequent kmers (default: 0.0002)

```
Sniff loads sequences in batches displaying progress along the way. Once all sequences have been loaded into memory they are again processed in batches. Each batch is used for construcing a target index from reverse complemented reads. Original reads are then mapped against the constructed index using a seed and chain approach with minor modifications. For each read we remember the strongest matching reverse complement read and output reverse complement pairs as overlaps. Here we define an overlap as a tuple `query_name, query_start, query_end, target_name, target_start, target_end`. Later those overlaps are processed with a pretrained machine learning model outputing the final result in a csv/tsv format for later use.

## Dependencies

### C++
- linux kernel 2.6.32 or higher
- gcc 11 or higher
- clang 11 or higher
Expand All @@ -50,24 +21,34 @@ Usage:
- earlier version should do just fine
- git is required for cmake to fetch par of internal dependencies

### Test (optional) dependencies
#### Test (optional) dependencies

- Catch2
- fetched via cmake if missing

## Build
### Python
conan2==0.0.4
joblib==1.3.2
lightgbm==4.1.0
polars==0.18.5
psutil==5.9.5
pydantic==1.10.9
scikit-learn==1.3.1

After running running git clone:
## Build

```bash
git clone [email protected]:tbrekalo/sniff.git
cd sniff
make release
```

Run the following:
## Usage

From sniff root directory:

```bash
cmake -S ./sniff -B ./sniff/build -DCMAKE_BUILD_TYPE=Release -G Ninja
cmake --build build
source ./venv/bin/activate
sniff -t 32 path_to_reads.fasta > /tmp/sniff-ovlps.csv
python ./scripts/inference/lgbm_filter.py -m resources/sniff-lgbm-model.pkl -o /tmp/sniff.csv > pairs.csv
```

You can ommit the `-G Ninja` option in case your host system doesn't support the `Ninja` build system.
10 changes: 2 additions & 8 deletions include/sniff/algo.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,16 @@
#include <vector>

#include "sniff/config.h"
#include "sniff/overlap.h"

namespace biosoup {
class NucleicAcid;
}

namespace sniff {

struct RcPair {
std::string lhs;
std::string rhs;

auto operator<=>(RcPair const&) const noexcept = default;
};

auto FindReverseComplementPairs(
Config const& cfg, std::vector<std::unique_ptr<biosoup::NucleicAcid>> reads)
-> std::vector<RcPair>;
-> std::vector<OverlapNamed>;

} // namespace sniff
20 changes: 20 additions & 0 deletions include/sniff/overlap.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,42 @@

#include <compare>
#include <cstdint>
#include <string>

namespace sniff {

struct Overlap {
std::uint32_t query_id;
std::uint32_t query_length;
std::uint32_t query_start;
std::uint32_t query_end;

std::uint32_t target_id;
std::uint32_t target_length;
std::uint32_t target_start;
std::uint32_t target_end;

double score;

friend constexpr auto operator<=>(const Overlap& lhs,
const Overlap& rhs) = default;
};

struct OverlapNamed {
std::string query_name;
std::uint32_t query_length;
std::uint32_t query_start;
std::uint32_t query_end;

std::string target_name;
std::uint32_t target_length;
std::uint32_t target_start;
std::uint32_t target_end;

friend auto operator<=>(const OverlapNamed& lhs,
const OverlapNamed& rhs) = default;
};

auto ReverseOverlap(Overlap const& ovlp) -> Overlap;
auto OverlapLength(Overlap const& ovlp) -> std::uint32_t;
auto OverlapRatio(Overlap const& ovlp) -> double;
Expand Down
11 changes: 6 additions & 5 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
autopep8
isort
mypy
pylint
pyright
autopep8==2.0.2
isort==5.12.0
mypy==1.4.0
mypy-extensions==1.0.0
pylint==2.17.4
pyright==1.1.315
10 changes: 7 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
polars
pydantic
psutil
conan2==0.0.4
joblib==1.3.2
lightgbm==4.1.0
polars==0.18.5
psutil==5.9.5
pydantic==1.10.9
scikit-learn==1.3.1
Binary file added resources/sniff-lgbm-model.pkl
Binary file not shown.
File renamed without changes.
File renamed without changes.
49 changes: 44 additions & 5 deletions eval/run_sniff.py → scripts/eval/run_sniff.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import pathlib
import sys
from time import perf_counter
from typing import List, Tuple
from typing import List

import polars as pl
from psutil import Popen
Expand Down Expand Up @@ -46,30 +46,69 @@ class RunInfo(BaseModel):
)


def format_sniff_args(sniff_args: SniffArgs, reads_path: str) -> List[str]:
def format_sniff_args(
sniff_args: SniffArgs, reads_path: str | pathlib.Path
) -> List[str]:
"""Transform SniffArgs model into a list of consecutive key key value pairs.
Eg.
class KwArgs(BaseModel):
a: int
b: str
args = KwArgs(a=1, b='a')

Here args object would get transformed into:
['--a' '1' '--b' 'a']

Args:
sniff_args: SniffArgs model representing cli arguments
reads_path: path to fasta/fastq files forwarded to sniff
"""
dst = [
val for k, v in sniff_args.dict().items()
for val in ('--' + k.replace('_', '-'), str(v))
if k != 'minhash'
]

dst.append(reads_path)
dst.append(str(reads_path))
return dst


def create_sniff_spawn_list(
sniff_path: str | pathlib.Path,
sniff_args: SniffArgs,
reads_path: str | pathlib.Path) -> List[str]:
"""Create a list of strings passed to Popen like function for running sniff.

Args:
sniff_path: string path to sniff executable
sniff_args: SniffArgs model representing cli arguments
reads_path: path to fasta/fastq reads forwarded to sniff

Returns:
A list of strings in format:
[sniff_path_str, *sniff_cli_args, reads]
"""
return [
str(sniff_path), *format_sniff_args(sniff_args, str(reads_path))
]


def run_sniff(
sniff_path: str,
sniff_path: str | pathlib.Path,
sniff_args: SniffArgs,
reads_path: str | pathlib.Path) -> pl.DataFrame:
"""Runs sniff executable monitoring it's runtime and memory consumption.
Sniff stderr and stdout are not piped nor captured in anyform.

Args:
sniff_path: string path to sniff executable
sniff_args: SniffArgs model representing cli arguments
reads_path: path to fasta/fastq reads forwarded to sniff

Returns:
A polars DataFrame with runtime information.
See DF_COLS for more details.
"""

with Popen(create_sniff_spawn_list(
sniff_path, sniff_args, reads_path)
Expand All @@ -92,7 +131,7 @@ def run_sniff(

if __name__ == "__main__":
parser = argparse.ArgumentParser(
prog='rc_stats',
prog='run_sniff',
description='run sniff and record runtime information'
)

Expand Down
Loading