Thomas M A Shafee, Andrew J Robinson, Nicole van der Weerden, Marilyn A Anderson
Department of Biochemistry, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, Australia
College of Science, Health and Engineering, La Trobe University, Melbourne, Australia
Life Sciences Computation Centre, Victorian Life Sciences Computation Initiative, Melbourne, Australia
Barcoding function:
cysbar.py -b POS1[:BARCODE1] [-b POS2[:BARCODE2] [-b ...]] [-B BARCODE.txt] [-S SEPARATOR] [INPUT.fa [INPUT2.fa ...]] >OUTPUT.fa
Reconstructing function:
cysbar.py -r -b [POS1:][BARCODE1] [-b [POS2:][BARCODE2] [-b ...]] [-B BARCODE.txt] [-V VARIABLES.csv] [-s OUTPUT.csv] [-S SEPARATOR] [INPUT.fa [INPUT2.fa ...]] >OUTPUT.fa
Arguments:
-h, --help Print this help information
-r Operate in reconstruction mode (i.e. restoring residues)
-b POS[:BARCODE] Specify the POSition to replace with BARCODE
POS Residue index (1-based number) to replace in each sequence
BARCODE Custom barcode to use at this position (or see -B option)
-B BARCODE.txt Use file named BARCODE.txt to source barcodes from (one per line)
-V VARIABLES.csv Use file named VARIABLES.csv to source hydropathy and charge
variables. Contains Residue, Hydropathy and Charge columns
(no headers)
-S SEPARATOR String to enclose replaced residues when placed in FastA ID
(Default: '__BC__')
INPUT.fa Input fasta file (default = standard input)
> OUTPUT.fa Output fasta file (default = standard output)
This readme describes in detail how to use the python script cysbar.py
locally and how to use the data analysing excel spreadsheet loopproperties.xlsx
. For simple online use of the scripts a webtool is available at CysBar.science.latrobe.edu.au.
The script is in the python programming language and runs on linux, or mac machine. Windows users need to install python. A short tutorial of the necessary basics to run pyton scripts is here:
www.python.org
The script is annotated and can be opened by any text editor to read and edit.
Both barcoding and reconstructing functions need the user to define some basic parameters (arguments). These are entered after cysbar.py
with -x
commands.
The input alignment is barcoded using cysbar.py
, which replaces columns of a fasta file (specified by -b
arguments). Removed residues are stored in the fasta ID between __BC__
separators. Custom barcodes can be used either by providing a list of sequences (-B
argument) or by defining a custom barcode to be used at a specific position (-b POS:BARCODE
argument).
Command for barcoding:
cysbar.py -b POS1 -b POS2 ...
Optional arguments:
-b POS:BARCODE Use a custom barcode at this position
-B BARCODE.txt Use file named BARCODE.txt to source barcodes from (one per line)
-S SEPARATOR String to enclose replaced residues when placed in FastA ID
(Default: '__BC__')
INPUT.fa Input fasta file (default = standard input)
> OUTPUT.fa Output fasta file (default = standard output)
Example:
bin/cysbar.py -b 11 -b 22 -b 28 -b 32 -b 43 -b 50 -b 52 -b 56 -B example_datasets/default_barcodes.txt example_datasets/example1.fa >example1_barcoded.fa
The above command will barcode the 11th, 29th... etc. columns of the example1.fa
alignment using the barcode set provided in example_barcodes.txt
, creating an output alignment called example1_barcoded.fa
. The example1_barcoded.fa
file can then be re-aligned by any standard alignment program (e.g. ClustalΩ, Probcons, Muscle).
Notes:
When several sub-group alignments need to be combined into a single alignment barcoded alignment, the individual sub-group alignments should be barcoded independently with cysbar.py
(as in the example figures used in the manuscript).
The barcoded alignments should then be concatenated into a single alignment file. This single combined alignment can be re-aligned with any standard alignment program. The reconstruction function can then be run on the re-aligned, combined alignment.
Once the barcoded alignment has been re-aligned by an appropriate program, cysbar.py -r
replaces the barcoded columns of a fasta file with the original columns that were removed and saved in the fasta ID by barcoding function. Additionally, it can provide summary file of the length, hydrophobicity and charge of each inter-cysteine region (-s
argument).
Command for reconstructing:
cysbar.py -r
Optional arguments:
-s OUTPUT.csv Produce (and save) summary statistics file named OUTPUT.csv
-b POS:BARCODE Define the custom barcode used at this position
-B BARCODE.txt Use file named BARCODE.txt to source used barcodes, one per line
-V VARIABLES.csv Use file named VARIABLES.csv to source hydropathy and charge
variables. Contains Residue, Hydropathy and Charge columns
(no headers)
-S SEPARATOR String that was used to enclose replaced residues in the FastA ID
(Default: '__BC__')
INPUT.fa Input fasta file (default = standard input)
> OUTPUT.fa Output fasta file (default = standard output)
Example:
bin/cysbar.py -r -s example_datasets/example2_stats.csv -V example_datasets/default_hydropathy_charge.csv example2.fa >example2_reconstructed.fa
The above command will restore the original sequences of the example2.fa
alignment, creating an output alignment called example2_reconstructed.fa
and use the hydropathy and charge variable set provided in example_hydropathy_charge.csv
to generate a summary of statistics called example2_stats.csv
.
Notes:
If the default barcodes was used, -b
arguments are unnecessary since the script will search for the default barcode sequences. If custom barcode sequences were used, the same sequences must be defined in the cysbar.py -r
command. For calculating loop properties, all cysteines are assumed to be oxidised (i.e. involved in disulphide bridges).
If cysbar.py -r
has been commanded to generate a statistics summary csv file (-s
argument), the data from this can be copy-pasted into loopproperties.xlsx
which will provide summary graphs. To use it follow these steps:
- In Excel, open both the statistics csv file and
loopproperties.xlsx
- Copy the whole spreadsheet of the statistics csv file
- Paste the copied contents into the first worksheet page of
loopproperties.xlsx
- The subsequent worksheets of
loopproperties.xlsx
give the following information:
a. Length – amino acid length
b. Hydropathy – Doolittle index of hydrophobicity
c. Charge – net coulomb charge
d. Figures – a summary of bar charts and histograms from the other sheets
Notes:
This spreadsheet only functions for a maximum of 8 barcoded columns. Parameters with no data (e.g. the charge of a loop with length zero) are marked with 1E-100 to distinguish from true zero values (e.g. the charge of a neutral loop). These 1E-100 values are ignored when generating the subsequent charts.