Modified from https://github.com/AlexOrlek/figleaf_fasta
Modifications:
- change constraints on hardmask letters, we can use "?" now.
- fixed bugs when using fasta file with more than one sequence, with --task='exclude' or with --inverse_mask=False
figleaf_fasta applies hard/soft masking to a FASTA file or excludes/extracts sub-sequences from a FASTA file.
- hard_mask: replace sequence with Ns or Xs
- soft_mask: convert sequence to lowercase
- exclude: exclude sub-sequences and concatenate non-excluded remainder
- extract: extract and concatenate sub-sequences
Other tools for handling FASTA files (e.g. bedtools maskfasta
, bedtools getfasta
, pybedtools
) require sequence name(s), corresponding to FASTA header names, to be specified (in addition to range information); sequence name specification allows different masking operations to be applied to different records in a multi-FASTA file.
figleaf_fasta is a simple lightweight tool that takes as input a (multi-)FASTA and range start, end positions; masking/exclusion/extraction will be applied to sequence(s) within the (multi-)FASTA, regardless of FASTA header names. This is useful if a user wants to apply the same masking to all FASTA files or all records of a multi-FASTA. A common use case is when handling reference-aligned (same-length) consensus FASTAs.
pip3 install figleaf_fasta
git clone https://github.com/AlexOrlek/figleaf_fasta.git
cd figleaf_fasta
pip3 install .
figleaf_fasta can be run from a Linux command-line as follows:
figleaf [
arguments...
]
figleaf_fasta can be used within a Python script as follows:
from figleaf_fasta.figleaf import figleaf
figleaf([
arguments...
])
Running figleaf -h
on the command-line produces a summary of the command-line options:
usage: figleaf [-h] -fi FASTA_INPUT -r RANGES_PATH -fo FASTA_OUTPUT [--task TASK] [--hard_mask_letter HARD_MASK_LETTER] [--inverse_mask]
figleaf_fasta: apply hard/soft mask to FASTA file or exclude/extract sub-sequences
optional arguments:
-h, --help show this help message and exit
Input:
-fi FASTA_INPUT, --fasta_input FASTA_INPUT
Filepath to input fasta file to be masked (required)
-r RANGES_PATH, --ranges_path RANGES_PATH
Two-column tsv file with rows containing 0-indexed end-exclusive ranges to be masked/excluded/extracted (required)
Output:
-fo FASTA_OUTPUT, --fasta_output FASTA_OUTPUT
Filepath for masked output fasta file (required)
Task:
--task TASK "hard_mask","soft_mask","exclude","extract" (default: hard_mask)
Mask:
--hard_mask_letter HARD_MASK_LETTER
Letter to represent hard_mask regions (N or X) (default: N)
--inverse_mask If flag is provided, all except mask ranges will be masked
The same arguments are required when using the figleaf function within a Python script, except that start, end positions can be provided either as a filepath (ranges_path
), OR as a Python list (ranges_list
).
To generate example output in the example/ directory, run:
python figleaf_fasta.py
or bash figleaf_fasta.sh