-
Notifications
You must be signed in to change notification settings - Fork 1
Home
borf is a command line python tool for translating RNA sequences into open reading frames (ORFs). The defaults for borf are set to provide the most fitting ORF translations from de novo assembled transcripts, such as those generated by Trinity. In addition to providing the ORF predictions, borf also provides details about ORF locations within the provided transcripts in a seperate file.
To install from PyPi:
pip install borf
To run, call borf and provide it with a fasta file as the first argument.
borf test.fa
This will run borf with the deafult settings and produce two output files: test.pep and test.txt.
borf produces two files, a .pep and a .txt file. The .pep file contains the predicted ORF sequences in fasta format, and the .txt file contains details about the predicted ORFs.
column name | description |
---|---|
orf_id | id assigned to the predicted orf sequences in the corresponding .pep file |
transcript_id | transcript id (from the input .fa file) |
frame | ORF reading frame (1-3) |
strand | ORF strand (+/-) |
seq_length_nt | length of the ORF in nt |
start_site_nt | position of the first nucleotide of the first predicted amino acid |
stop_site_nt | position of the last nucleotide of the last predicted amino acid |
utr3_length_nt | length of the 3' UTR in nt |
start_site_aa | position of the ORF's first predicted amino acid |
stop_site_aa | position of the ORF's stop site (*) or the last predicted amino acid (when no stop codon is found) |
orf_length_aa | length of the ORF in aa |
first_aa_MET | is the first amino acid of the ORF a Methionine (M/MET)? (M/ALT) |
final_aa_stop | is the last amino acid of the ORF a STOP (*)? (STOP/ALT) |
orf_class | orf class. One of 'complete' (first aa is M, and last is *); 'incomplete_5prime' (first aa is not M, and last is *); 'incomplete_3prime' (first aa is M, and last is not *); or 'incomplete' (first aa is not M, and last is not *) |
We provide ORF classes as - particularly for denovo assembled transcripts - transcript annotations may be incomplete and missing parts of the 3' or 5' end. This allows ORFs which have uninterrupted strings of amino acids - but not neccessarily a start or a stop codon - to still be returned which can then be used in downstream applications such as functional domain annotations.
borf has several options which can be changed to suit your data. To display all, use the -h or --help flag.
$ borf --help
usage: borf [-h] [-o OUTPUT_PATH] [-s] [-a] [-l ORF_LENGTH]
[-u UPSTREAM_INCOMPLETE_LENGTH]
fasta_file
Get orf predicitions from a nucleotide fasta file
positional arguments:
fasta_file fasta file to predict ORFs
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_PATH, --output_path OUTPUT_PATH
path to write output files. [OUTPUT_PATH].pep and
[OUTPUT_PATH].txt (default: input .fa file name)
-s, --strand Predict orfs for both strands
-a, --all_orfs Return all ORFs for each sequence longer than the
cutoff
-l ORF_LENGTH, --orf_length ORF_LENGTH
Minimum ORF length (AA). (default: 100)
-u UPSTREAM_INCOMPLETE_LENGTH, --upstream_incomplete_length UPSTREAM_INCOMPLETE_LENGTH
Minimum length (AA) of uninterupted sequence upstream
of ORF to be included for incomplete_5prime
transcripts (default: 50)
To change the default output file locations (same as input file, with .fa replaced by .pep or .txt), use the -o flag to provide a base file name (and path). e.g. borf test.fa -o test_borf
will produce test_borf.pep and test_borf.txt.
To return all predicted ORFs longer than the minimum ORF length, use the -a flag.
To predict ORFs on both strands, use the -s flag. Note that unless the -a flag is as well, only the single longest ORF will be reported for each transcript, not one prediction for each strand.
The default ORF length is set to 100 amino acids. This can be changed using the -l argument and providing an integer.
The default upstream incomplete length is 50. This can be changed using the -u argument and providing an integer.
NOTE: We do not reccomend setting this lower than 50AA. A large proportion of transcripts will have uniterrupted upstream AAs up to 40AA long (average ~ 25AA in well annotated human transcripts) which do not code for protein sequence.