Skip to content

walkthrough manual

MartenHoogeveen edited this page May 27, 2019 · 42 revisions

This documentation page takes you by the hand and walks with you through the code from input to output. Reading this documentation can help you to easier find bugs or make adjustments. The reason that blastn is "wrapped" is because this repo is specially made for the use in galaxy. This page only describes the python scripts itself.

Blasting with blastn_wrapper.py

It all starts with executing blastn_wrapper.py

blastn_wrapper.py -i [input] -it [input type] -of [output folder] -bt [blast task] -bm [max_target_seqs] -dbt [blast database type] -db [database] -id [identity treshold] -cov [coverage treshold] -outfmt [output format]

Parameter explanation:

param Description
-i [input] Input file in fasta format or a zip file containing fasta files
-it [input type] fasta or zip
-of [output folder] name of the folder to store the output and temp files, automatically created
-bt [blast task] blastn or megablast
-bm [max_target_seqs] maximum number hits to write to the output (-max_target_seqs or -num_alignments)
-dbt [blast database type] user (fasta file) or local (already indexed fasta file)
-db [database] path to the blast database
-id [identity treshold] identity treshold for blast (-perc_identity)
-cov [coverage treshold] coverage treshold
-outfmt [output format] blast outfmt format, ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11','custom_taxonomy']

Example:
A "real life" command to execute this script would look like something like this

blastn_wrapper.py -i test.fasta -it fasta -of /home/ubuntu/testsample_output -bt megablast -bm 1 -dbt local -db my/databases/nt -id 98 -cov 80 -outfmt custom_taxonomy

More info about -outfmt options can be found here: https://www.ncbi.nlm.nih.gov/books/NBK279684/ this piece of documentation describes only the custom_taxonomy option which is the power of this tool and makes it unique.

From here on we walk trough the code:

1. Output and temp folders are being created
This function is called in the main(), it simply creates some folders. Based on the example command the following folder will be made:

/home/ubuntu/testsample_output
/home/ubuntu/testsample_output/files
/home/ubuntu/testsample_output/fasta

def make_output_folders():
    call(["mkdir", "-p", args.out_folder.strip()])
    call(["mkdir", args.out_folder.strip() + "/files"])
    call(["mkdir", args.out_folder.strip() + "/fasta"])

2. Unpack or copy
This is the second function called in the main(). If the user choose to input a zip file it will be unpacked and the containing files will be written to /home/ubuntu/testsample_output/fasta. If the user choose to input a single fasta file it will be copied to the /home/ubuntu/testsample_output/fasta folder. The function looks like this:

def unpack_or_cp():
    if args.input_type == "zip":
        zip_out, zip_error = Popen(["unzip", args.input, "-d", args.out_folder.strip() + "/fasta"], stdout=PIPE,stderr=PIPE).communicate()
        admin_log(zip_out, zip_error)
    else:
        cp_out, cp_error = Popen(["cp", args.input, args.out_folder.strip() + "/fasta"], stdout=PIPE,stderr=PIPE).communicate()
        admin_log(cp_out, cp_error)

3. Extension check and rename
The third step is executing the extension_check_and_rename() function in main(). This function does a quick check if the input is in fasta format and renames the files. Some users tend to give file names that are not compatible with other downstream analyses. Below a simple break-down:

Create list of files present in /home/ubuntu/testsample_output/fasta and loop trough that list

files = [os.path.basename(x) for x in sorted(glob.glob(args.out_folder.strip() + "/fasta/*"))]
for x in files:

Check if the file is a valid fasta and rename. Renaming means that the characters dash, dot and space are replaced if present with an underscore. Non valid fasta files will be removed.

if check_if_fasta(args.out_folder.strip() + "/fasta/" + x):
   fastafile = os.path.splitext(x)[0].translate((string.maketrans("-. ", "___"))) + ".fa"
   if x != fastafile:
       call((["mv", args.out_folder.strip() + "/fasta/" + x, args.out_folder.strip() + "/fasta/" + fastafile]))
else:
    admin_log(error="Problems with fasta file, file will be ignored: " + x,function="extension_check")
    call(["rm", args.out_folder.strip() + "/fasta/" + x])

If the /home/ubuntu/testsample_output/fasta folder is empty the script stops

if not os.listdir(args.out_folder.strip() + "/fasta/"):
    admin_log(error="No fasta file found", function="extension_check")
    sys.exit("No fasta file found")

4. Write header to output file
Step 4 is a simple function that writes the header to the outputfile. This function is called make_head_line() and is executed from the main()

5. Create blast database if necessary
If the user used the parameter -dbt the user wants to use a fasta file as reference. If so, before blasting a blast database need to be created. This is done by the make_user_database() function.

6. blasting

The actual blasting is done by the last function in main() called blast_fasta(). Below a simple break-down:

Loop trough the fasta files, if the user's input was a fasta file the list will contain only that one file

files = [x for x in sorted(glob.glob(args.out_folder.strip() + "/fasta/*.fa"))]
#just an extra check
file_count_check(files)
for query in files:

Some log text is created, a output name is set and the blast command is created
admin_log(out="BLAST: blasting " +str(os.path.basename(query)), error=None, function="blast")
output_name = "blast_" + os.path.splitext(os.path.basename(query))[0] + ".tabular"
blast_command = create_blast_command(query, output_name)

This executes the created blast command, the output files of blast are written to /home/ubuntu/testsample_output/files.
blast_out, blast_error = Popen(blast_command, stdout=PIPE,stderr=PIPE).communicate()
admin_log(blast_out, blast_error, "blasting:"+str(os.path.basename(query)))

If the user used the parameter -cov and -outfmt custom_taxonomy there will be filtered on a coverage threshold. This is a separate function because blast does not have a coverage parameter.

if args.coverage and args.outfmt == "custom_taxonomy":
    coverage_filter(args.out_folder.strip() + "/files/" + output_name.strip())

This code gives the output files a header

        cat_out, cat_error = Popen("cat "+ args.out_folder.strip() + "/files/head_line.txt "+ args.out_folder.strip() + "/files/" + output_name.strip()+ " > "+ args.out_folder.strip() + "/files/head_" + output_name.strip(), stdout=PIPE, stderr=PIPE, shell=True).communicate()
        admin_log(cat_out, cat_error, "cat:" + str(os.path.basename(query)))
        mv_out, mv_error = Popen(["mv", args.out_folder.strip() + "/files/head_" + output_name.strip(), args.out_folder.strip() + "/files/" + output_name.strip()], stdout=PIPE, stderr=PIPE).communicate()
        admin_log(mv_out, mv_error, "mv:" + str(os.path.basename(query)))

The blasting part itself is now done. You can find the outputfiles in /home/ubuntu/testsample_output/files but we are not finished yet.

Adding taxonomy with blastn_add_taxonomy.py

After blasting you can add taxonomy information with the python script blastn_add_taxonomy.py. This script is written with a more object oriented approach. You can find the classes in the add_taxonomy_scripts folder. Let me first show you two example outputs.
Genbank (and genbank sub-selections):


More to come...