This tutorial is meant for members of the Kimball Lab at the University of Minnesota (Northern Wild Rice Conservation and Breeding) to understand how I made synteny plots using MCscan. The original tutorial by the author (Haibao Tang) can be found here. This is an excellent resource (and is where I learned how to use the program). The tutorial given here is more focused on the needs of our program.
- Zizania palustris vs. Oryza sativa
- Zizania palustris vs. Zizania latifolia (2015 version)
- Zizania palustris vs. Zizania latifolia (2022 version)
- Three species comparison
- Coloring specific links
- Microsynteny
- Claudia microsynteny
Depending on when you return to this tutorial, you may find that a script no longer works. When I came back to these scripts in January 2022 to put this tutorial together, the scripts (which previously worked) failed. The problem was that the system could not find the module files. The reason was that module load python
was too generic. The default version became python 3.8 and I had installed the software under version 3.7. I fixed it by specifying module load python3/3.7.4_anaconda2019.10
in the script. Once I did that, I had no issues whatsoever.
There are three lines in the script that use sed
. They are there to remove superfluous trailing strings in the bed
and cds
files that cause identical gene names to not be recognized as identical.
sed -i 's/\.MSUv7.0//g' oryza.bed
sed -i 's/\..*$//g' oryza.cds
sed -i 's/\-.*$//g' wild_rice.cds
The last line of the script called run_jcvi.sh is what ultimately generates the figure. Note that seqids
and layout
refer to files that provide vital information for how the figure appears.
python -m jcvi.graphics.karyotype seqids layout --format png
Here are some details about the layout
file specifications:
Note: As of this writing (14 January 2022), GitHub does not support colorizing text using HTML or other various methods that I found to insert color. The method that finally worked was to use placeholder.com. If previews of the color are unavaible, I apologize for that but the reason is likely a broken link. It's a minor thing, but feel free to report it to me if you encounter that issue.
When you're done, the finished figure should look like this:
This figure was made using the 2015 version of the Z. latifolia genome. You can find that paper here or using the following citation:
Guo L., Qiu J., Han Z., Ye Z., Chen C., Liu C., Xin X., et al. (2015) A host plant genome (Zizania latifolia) after a century-long endophyte infection. Plant J. 83, 600-609
Pay special attention to the seqids file because it is different than the one used in the comparison with O. sativa owing the the fact that Z. latifolia genome has a different number of chromosomes than O. sativa and the first version of the Z. latifolia genome is more fragmented than the Z. palustris genome. That is, there are more scaffolds than chromosomes/pseudochromosomes. Look at the seqids_latifolia file to see what I mean. This is why there are two tracks for Z. latifolia: one above the Z. palustris track and the other below it. Note: When I ran this in my MSI account, the file was just called seqids
but I renamed it here to avoid a file duplication conflict with the version used in the comparison with O. sativa. The same principle is true for the layout_latifolia file.
Colors are specified in the layout
file. We chose our color scheme after seeking consensus within the group on colors we liked and to be internally consistent. That means we use the same color for the same species across multiple figures within the same paper.
When you're done, the finished figure should look like this:
The data were downloaded from here using the following commands:
wget https://download.cncb.ac.cn/gwh/Plants/Zizania_latifolia_Zlat_genome_v1_GWHBFHI00000000/GWHBFHI00000000.gff.gz --no-check-certificate
Note: The --no-check-certificate
was added because the MSI system prompted me to add it.
After retrieving the necessary data, I converted (renamed) the GFF
file to GFF3
format. This is because I was having problems with the next steps (finding anchors) and thought this had something to do with it. The problem, I thought, was that the proper gene names (beginning with "Zla") were not being inserted into the BED
file; instead, the gene names all began with "gene_" followed by some number. Ultimately, this did not resolve my issue, but since it fixed the problem of bizarre gene names being inserted into the BED
file, I decided to leave it.
mv GWHBFHI00000000.gff.gz GWHBFHI00000000.gff3.gz
The next step is to convert the GFF
file to BAM
format:
python -m jcvi.formats.gff bed GWHBFHI00000000.gff.gz -o latifolia_version_2.bed
Next:
python -m jcvi.formats.fasta format GWHBFHI00000000.CDS.fasta.gz latifolia_version_2.cds
The major roadblock in completing the analysis was that the primary gene names in the .cds
file did not match up with the "Zla" gene names in the .bed
file (or the "gene_" names for that matter). The "Zla" names were included further along in the .cds
file headers so all that was needed to grab them was to remove a bunch of the extraneous information in the header of each gene. Once all of that was removed, the "Zla" gene names would be recognized as the primary gene name. In order to fix the .cds
file, you will need to use the following sed one-liner:
sed -i 's/^.*OriGeneID=/>/g' latifolia_version_2.cds
Before making the karyotype figure, we need to change (simplify) the chromosome names in the Z. latifolia .bed
file:
sed -i 's/GWHBFHI00000001/Chr1/g' latifolia_version_2.bed
Note: Regarding the order of the Zizania latifolia chromsome order: the authors assigned chromosome numbers based on size in decreasing order. This explains why our chromosome numbers do not match up despite the close evolutionary relationship of Zizania latifolia and Zizania palustris. We assigned our chromosome numbers based on their synteny with Oryza sativa. At the time of our publication, the updated Zizania latifolia genome was not available.
If we wanted to make another plot showing synteny between all three species (Zizania palustris, Zizania latifolia, and Oryza sativa), we only need to modify small aspects of the code we've already run. The .bed
and .cds
files already exist. The layout
and seqids
files only need slight modifications. To keep things clean, I created a new subdirectory to work in: /home/jkimball/haasx092/other_synteny_figures/three_species_plot
and copied all of the relevant .bed
and .cds
files to this directory. I also copied the existing layout
and seqids
files to this directory before modifying them. The basic layout
file looks like this:
#y, xstart, xend, rotation, color, label, va, bed, label_va
0.8, 0.18, 0.98, 0, #235e39, \it Z. latifolia \space -, top, latifolia_version_2.bed, center
0.6, 0.18, 0.98, 0, #4b0082, \it Z. palustris \space -, top, wild_rice.bed, center
0.4, 0.18, 0.98, 0, #ff7f00, \it O. sativa \space -, top, oryza.bed, center
# edges
e, 0, 1, wild_rice.latifolia_version_2.anchors.simple
e, 1, 2, wild_rice.oryza.anchors.simple
Note: The syntax for the labels are a little weird (for lack of a better word in my opinion). The \it
aspect is LaTeX syntax for converting the text that follows it to italics (which is necessary for a Latin binomial) and the \space
LaTeX syntax represents an attempt to add space after the species name because the text was running into the graphical representation of the first chromosome for each species. I'm not convinced that it was actually doing anything because even when I added 5 of these, I still had this problem. Other LaTeX solutions for adding spaces resulted in errors, but this completed without error so I left it in. Anyway, I think it was the inclusion of the hyphen (-
) that actually soved my issue. It's also not really visible in the final figure unless you choose to rotate the position of the chromosomes. This is also why the genus names are not completely spelled out-to shorten the length of the name to avoid having it run into the graphical representation of the chromosomes. I did not previously have this issue, so I don't know why I am having it now-but I am. I also tried to play around with the xstart
position, but that just resulted in the label running off the left side of the page. You might also notice that the y
positions denote where along the y-axis each track is drawn (e.g., 0.4, 0.6, and 0.8). This is why there will be superfluous white space at the bottom of each figure. I would recommend cropping that white space out using any editor of your choice. Especially if you want to put the final figure into a paper where it won't be visible, except that it could affect formatting in a multi-panel figure or if it would need to be shrunk to avoid overlapping with text.
Now for the seqids
file:
Chr1,Chr2,Chr3,Chr4,Chr5,Chr6,Chr7,Chr8,Chr9,Chr10,Chr11,Chr12,Chr13,Chr14,Chr15,Chr16,Chr17
Chr1,Chr2,Chr3,Chr4,Chr5,Chr6,Chr7,Chr8,Chr9,Chr10,Chr11,Chr12,Chr13,Chr14,Chr15
Chr1,Chr2,Chr3,Chr4,Chr5,Chr6,Chr7,Chr8,Chr9,Chr10,Chr11,Chr12
All this file really does is specify the chromosome labels for each species. The top line (track 0) represents Zizania latifolia, the middle line (track 1) represents Zizania palustris, and the third line (track 2) represents Oryza sativa. They can be in whichever order you like, but the layout
and seqids
files need to agree with each other.
Note: Since we are reusing exising files (because why reinvent the wheel?), you might forget that you need to load the required programs. So, remember to load all of the necessary programs before you try to run any code 😃
export PATH=$PATH:/home/jkimball/haasx092/synteny_figure/last-1060/src
export PATH=$PATH:/home/jkimball/haasx092/synteny_figure/install-tl-20200505/1/bin/x86_64-linux
module load python
Then, all you need to run is:
python -m jcvi.graphics.karyotype seqids layout --format png
This "default" figure will look like this:
You can rotate whichever chromosome track you like by altering the the "rotation" parameter in the layout
file:
#y, xstart, xend, rotation, color, label, va, bed, label_va
0.8, 0.18, 0.98, 15, #235e39, \it Z. latifolia \space -, top, latifolia_version_2.bed, center
0.6, 0.18, 0.98, 0, #4b0082, \it Z. palustris \space -, top, wild_rice.bed, center
0.4, 0.18, 0.98, -15, #ff7f00, \it O. sativa \space -, top, oryza.bed, center
# edges
e, 0, 1, wild_rice.latifolia_version_2.anchors.simple
e, 1, 2, wild_rice.oryza.anchors.simple
After modifying the layout
file, run the same one line of code as you did before:
python -m jcvi.graphics.karyotype seqids layout --format png
If you would rather have straight lines showing connections between syntenic regions (rather than Bezier curves), simply add --shadestyle=line
to the code that creates the figure.
python -m jcvi.graphics.karyotype seqids layout --format png --shadestyle=line
At some point, you may want to add color between links to draw your audience's attention to one or a few specific links. You can do this by making small edits to the .simple
file(s). Also just for the record, the default grey is called "gainsboro" in the python named color universe. (I didn't know where else to mention this or if anyone would care, but I found it while investigating how the karyotype.py
script works, so I thought I would mention it.
You add color by adding r*
before the genes that you want to highlight. Below is a sample excerpt from a .simple
file. I am not including a full .simple
file in this README
document because it would take up unnecessary space and I think this gets the point across. The "r" will make it red. If you want, you can pick other colors like blue (b), green (g), cyan (c), magenta (m), yellow (y), black (k), or white (w). You can also specify any color you want using its HEX code. More on that below.
FUN_001229 FUN_001281 LOC_Os09g27940 LOC_Os09g28560 52 -
FUN_001289 FUN_001454 LOC_Os09g24540 LOC_Os09g27590 205 -
FUN_001457 FUN_001632 LOC_Os09g36140 LOC_Os09g39870 247 -
FUN_001647 FUN_001719 LOC_Os09g33510 LOC_Os09g36040 103 -
FUN_002470 FUN_002503 r*LOC_Os01g39890 LOC_Os01g40700 48 -
FUN_002506 FUN_002599 r*LOC_Os01g36070 LOC_Os01g39630 146 -
FUN_002675 FUN_002796 r*LOC_Os01g16090 LOC_Os01g19970 173 +
FUN_002804 FUN_002841 r*LOC_Os01g20970 LOC_Os01g22954 66 +
FUN_002883 FUN_002979 r*LOC_Os01g13100 LOC_Os01g15290 130 +
FUN_002984 FUN_003152 r*LOC_Os01g09890 LOC_Os01g13040 212 +
FUN_003189 FUN_003284 r*LOC_Os01g07810 LOC_Os01g09770 127 +
FUN_003318 FUN_003365 r*LOC_Os01g05430 LOC_Os01g06640 71 -
FUN_003391 FUN_003453 r*LOC_Os01g01890 LOC_Os01g03980 103 +
FUN_003488 FUN_003588 LOC_Os08g01054 LOC_Os08g02860 127 +
FUN_003593 FUN_003761 LOC_Os08g03240 LOC_Os08g06640 213 +
FUN_003768 FUN_003798 LOC_Os08g07400 LOC_Os08g08210 46 +
Using the complete version of the above .simple
file will result in the following figure:
Using HEX codes, the .simple
file would look something like this:
FUN_014647 FUN_014795 LOC_Os05g50490 LOC_Os05g51860 132 -
FUN_014851 FUN_014891 LOC_Os05g49580 LOC_Os05g50130 47 -
FUN_015145 FUN_015178 LOC_Os02g42406 LOC_Os02g42850 34 -
#235e39*FUN_015215 FUN_015367 LOC_Os01g39770 LOC_Os01g42820 195 -
#235e39*FUN_015373 FUN_015488 LOC_Os01g36580 LOC_Os01g39380 146 -
#235e39*FUN_015503 FUN_015653 LOC_Os01g31110 LOC_Os01g35230 196 -
#235e39*FUN_015708 FUN_015846 LOC_Os01g18420 LOC_Os01g21440 158 +
#235e39*FUN_015892 FUN_016018 LOC_Os01g22336 LOC_Os01g25600 162 +
#235e39*FUN_016056 FUN_016149 LOC_Os01g15470 LOC_Os01g16230 75 +
#235e39*FUN_016190 FUN_016243 LOC_Os01g14310 LOC_Os01g14870 50 -
#235e39*FUN_016235 FUN_016350 LOC_Os01g16240 LOC_Os01g18100 109 -
#235e39*FUN_016439 FUN_016539 LOC_Os01g09590 LOC_Os01g10520 92 +
#235e39*FUN_016564 FUN_016713 LOC_Os01g10680 LOC_Os01g12690 162 +
#235e39*FUN_016715 FUN_016919 LOC_Os01g07080 LOC_Os01g08860 181 -
#235e39*FUN_016923 FUN_017105 LOC_Os01g02860 LOC_Os01g06836 243 -
#235e39*FUN_017109 FUN_017164 LOC_Os01g01010 LOC_Os01g02200 76 -
#235e39*FUN_017179 FUN_018308 LOC_Os01g58080 LOC_Os01g74570 1230 -
#235e39*FUN_018309 FUN_019085 LOC_Os01g45140 LOC_Os01g57940 894 -
#235e39*FUN_019090 FUN_019163 LOC_Os01g42840 LOC_Os01g44360 95 -
FUN_019632 FUN_019675 LOC_Os04g08060 LOC_Os04g09260 50 -
FUN_019707 FUN_019766 LOC_Os04g01070 LOC_Os04g03100 88 -
Note: I used #235e39
because I think it is a particularly pretty shade of green. #235e39
=
You can also choose to use any legal HTML color that you like (e.g., "burlywood" or "chartreuse").
In addition to color, you can choose various shades of grey by specifying a number between 0 and 1 (e.g., 0.5 as in the example below).
FUN_014647 FUN_014795 LOC_Os05g50490 LOC_Os05g51860 132 -
FUN_014851 FUN_014891 LOC_Os05g49580 LOC_Os05g50130 47 -
FUN_015145 FUN_015178 LOC_Os02g42406 LOC_Os02g42850 34 -
0.5*FUN_015215 FUN_015367 LOC_Os01g39770 LOC_Os01g42820 195 -
0.5*FUN_015373 FUN_015488 LOC_Os01g36580 LOC_Os01g39380 146 -
0.5*FUN_015503 FUN_015653 LOC_Os01g31110 LOC_Os01g35230 196 -
0.5*FUN_015708 FUN_015846 LOC_Os01g18420 LOC_Os01g21440 158 +
0.5*FUN_015892 FUN_016018 LOC_Os01g22336 LOC_Os01g25600 162 +
0.5*FUN_016056 FUN_016149 LOC_Os01g15470 LOC_Os01g16230 75 +
0.5*FUN_016190 FUN_016243 LOC_Os01g14310 LOC_Os01g14870 50 -
0.5*FUN_016235 FUN_016350 LOC_Os01g16240 LOC_Os01g18100 109 -
0.5*FUN_016439 FUN_016539 LOC_Os01g09590 LOC_Os01g10520 92 +
0.5*FUN_016564 FUN_016713 LOC_Os01g10680 LOC_Os01g12690 162 +
0.5*FUN_016715 FUN_016919 LOC_Os01g07080 LOC_Os01g08860 181 -
0.5*FUN_016923 FUN_017105 LOC_Os01g02860 LOC_Os01g06836 243 -
0.5*FUN_017109 FUN_017164 LOC_Os01g01010 LOC_Os01g02200 76 -
0.5*FUN_017179 FUN_018308 LOC_Os01g58080 LOC_Os01g74570 1230 -
0.5*FUN_018309 FUN_019085 LOC_Os01g45140 LOC_Os01g57940 894 -
0.5*FUN_019090 FUN_019163 LOC_Os01g42840 LOC_Os01g44360 95 -
FUN_019632 FUN_019675 LOC_Os04g08060 LOC_Os04g09260 50 -
FUN_019707 FUN_019766 LOC_Os04g01070 LOC_Os04g03100 88 -
A note regarding using RGB (red, green, blue) color notation:
Another way of specifying colors is with the RGB format or RGBA format where a=alpha, a value indicating how transparent a color should be where 0=fully transparent and 1=fully opaque). In RGB/RGBA format, each value needs to be a float value between 0 and 1 so if you have integers (such as 35,94,57 that matches HEX code #235e39 above) you could convert them to floats by dividing each value by 255. I do not think it is possible to specify a specific color in this particular way. I tried every variation of syntax that I thought would work, but every single one of them resulted in a traceback error. I did some digging through source code and I think I understand why RGB notation cannot be used. The code for MCscan is written in python and as such the colors are ultimately derived from matplotlib. The links are drawn using code contained within the synteny.py
script from MCscan. In that script, an object of class Shade() is defined. There are several optional parameters including two relevant parameters called edge color (ec) and face color (fc) with the default set to black (k). In the karyotype.py
script, the default value for both ec and fc parameters is changed to "gainsboro" which explains why the default links are grey and not black.
I think the thing that is key here is that the parameters expect strings as input. This is why single-color codes, HEX codes, named colors, and even float values (if you desire greyscale) all work: they can all be interpreted as string values. Even though the greyscale values are technically floats, they can be interpreted as strings before being converted to floats. The RGB notation would be interpreted as a tuple. With this in mind, there is probably a solution to enable the usage of RGB values involving editing the synteny.py
source code, but it's probably just easier to translate your RGB values to a HEX code using the internet. 🌐
In some cases, you may want to look at a specific region more closely, such as the one surrounding the shattering4 (sh4) gene.
The purpose of this project was to visualize a cluster of genes involved in phytoalexin synthesis in NWR. Claudia created a nice PowerPoint presentation including a comparative genomics slide looking at microsynteny between Zizania palustris, Zizania latifolia, and Oryza sativa. The genomic interval of interest was on Zizania palustris chromosome 6 (ZPchr0006) between 33,535,817 bp and 33,594,019 bp.
Using the start/end values of this interval as a starting point, I found several genes in this interval (FUN_030362, FUN_030363, FUN_030364, novel_gene_1089_5e3b9126, novel_gene_140_5e3b9126, FUN_030365, FUN_030366, FUN_030367, FUN_030368) in the file wild_rice_oryza.bed
. I did this by searching for the staring point (33535817
). You do this by opening the file (less wild_rice_oryza.bed
), then hitting the esc
(escape) key, then the shift
(shift) and :
(colon) keys (together). You should see a colon appear in the bottom left corner of the terminal window. Hit the forward slash key (/
), then the starting number of the interval (33535817
), then hit enter. The first hit is in fact on chromosome 6, so I know that I am getting the right genes/on the right chromosome. The start position is in column 2 and the end position is in column 3. In fact, you can see the end position of the interval (33594019
) a few rows down. The genes of interest (in this interval) are in column 4.
In order to identify which genes from the other species correspond to these NWR genes, I had to search the pre-existing anchor files (e.g., wild_rice_filt.latifolia_version_2.lifted.anchors
). These are MC-scan specific files that show homologous genes. This is how I created (manually) the helper text file claudia_blocks_ordered
.
In order to get the coordinates for each gene, you'll need to use a bed
file. I created the file 3species_micro-collinearity_copy.bed
by concatenating the component species' bed
files:
cat latifolia_version_2.bed wild_rice_oryza.bed > 3species_micro-collinearity_copy.bed
Note: The file wild_rice_oryza.bed
already existed and was in the correct order (Zizania palustris then Oryza sativa). If that wasn't the case, you would need to concatenate the individual files (latifolia_version_2.bed
, wild_rice.bed
, and oryza.bed
) in the order that you want them to appear in the bed
file.
One thing to pay attention to is that the species in the bed file must occur in the order that you want them to appear in the figure.
After the first attempt at generating this figure, the image looks like this:
Note: The output of the MCscan code is a PDF file. Before uploading it here, I first converted it to PNG format (as I do for all of my GitHub images), but I also cropped out excess whitespace because there is often excess whitespace around MCscan figures (especially the karyotype figures like this one). Also for this reason, some of the example figures shown below might look like they are different sizes. This is because when I cropped out the excess white space, I didn't pay super close attention to ensuring that I cropped out exactly the same amount of white space each time. I figured it isn't really necessary for purposes of this tutorial (vs. preparing a figure for publication in a manuscript).
In the initial figure, you'll notice that the Zizania latifolia and Oryza sativa chromosomes have gene names from other species overlapping with gene names from the appropriate species. This should not happen. I think the reason this happened is because MCscan was pulling information about which chromosomes to use from the blocks.layout
file. Because both Zizania latifolia and Oryza sativa have similar genomic intervals (22.55-22.62 Mb and 21.66-21.86 Mb, respectively) despite being on different chromosomes (Chr08 and Chr02, respectively), MCscan pulled information from Oryza sativa chromosome 8 in the bed
file and plotted it along Zizania latifolia chromosome 8 genes. It did the same with Zizania latifolia chromosome 2 genes--plotting them alongside Oryza sativa chromosome 2 genes. The Zizania palustris chromosome 6 did not plot gene names from either of the other species because its interval (33.54-33.59 Mb) must not have any genes in the bed
file from those species.
I thought that filtering out all of the unused/irrelevant information from the bed
file would solve this problem. After trying it to see if I was right, I found that I was indeed right, so I will describe the steps that I took to achieve this next.
Filtering the bed
file:
awk '$4~/^Zla08/' 3species_micro-collinearity.bed >> 3species_micro-collinearity_filtered.bed
awk '$1~/Chr6/ && $4~/^FUN/' 3species_micro-collinearity.bed 3species_micro-collinearity.bed >> 3species_micro-collinearity_filtered.bed
awk '$4~/^LOC_Os02/' 3species_micro-collinearity.bed >> 3species_micro-collinearity_filtered.bed
Note: I used the >>
symbol so I could iteratively add (append) selections for each specific species/chromosome pair without overwriting the previous work, which would have happened if I used the >
symbol (writes the "standard output" from a Unix command to a file). Here's the logic of the AWK code: Field 4 ($4
) in the bed
file contains the gene names and both the Zizania latifolia and Oryza sativa gene names indicate both species (e.g.,"Zla" or "LOC_Os") and chromosome number (e.g., "08" or "02"), so I was able to select on both factors simultaneously. Because the gene names for Zizania palustris are the original gene names from the FUNannotate pipeline, they have the prefix "FUN" and so do not contain any reference to species or chromosome number. This is why I also had to select the first field/column ($1
) to select for the chromosome that we wanted.
There was another change that I had to make to improve upon the initial figure. In the initial figure, you'll notice that the links go from the Zizania latifolia chromosome directly to that of Oryza sativa. There are no links from Zizania palustris to Oryza sativa. That's because in the initial blocks.layout
file, I the line e, 0, 1
repeated twice under #edges
. Changing the second line to e, 1, 2
fixed this. Here is an excerpt from the MCscan tutorial:
Like the layout file in the macro-synteny section, edges stanza say connecting grape (column 0) with peach (column 1) and grape (column 0) with cacao (column 2).
That figure has one chromosome (grape) on top and two chromsomes (peach and cacao) next to each other on the . In any case, I think you should interpret the blocks.layout
file shown below as "connect Zizania latifolia to Zizania palustris (e.g., 0,1) and connect Zizania palustris to Oryza sativa (e.g., 1,2)."
The top three lines (apart from the header line) indicate where the chromosome lines themselves are oriented. I think the x-axis value (0.6) for all ends up being the center of the chromosome while the y-axis values (0.8, 0.6, and 0.4) orient the chromosomes vertically. I set horizontal alignment ("ha") to be "left", but I think the lengths of some of the chromosomes (especially Oryza sativa) impact this in practice.
The blocks.layout
file should look like this:
# x, y, rotation, ha, va, color, ratio, label, chr
0.6, 0.8, 0, left, center, #235e39, 1, Zizania\ latifolia, chr08
0.6, 0.6, 0, left, center, #4b0082, 1, Zizania\ palustris, chr06
0.6, 0.4, 0, left, center, #ff7f00, 1, Oryza\ sativa, chr02
# edges
e, 0, 1
e, 1, 2
(The species names are currently not in italics like I think they should be, but that's a problem for another day.)
Another thing that we might want to do is change the color of the genes to show homology between species. The default colors are blue and green which I think represent the strand that the gene is on. In order to change the color of the genes, we simply add --glyphcolor=orthogroup
to the script so that it now looks like this:
python -m jcvi.graphics.synteny claudia_blocks_ordered 3species_micro-collinearity_filtered.bed blocks.layout --genelabelsize=5 --glyphcolor=orthogroup
Now, the figure looks like this:
Now that some of the bigger issues with the figure have been addressed, it is time to figure out how to italicize the species names (to fit with convention). Simplly adding \it
in front of each species name (and removing the \
after the genus names is enough to do achieve this.
The blocks.layout
file should now look like this:
# x, y, rotation, ha, va, color, ratio, label, chr
0.6, 0.8, 0, left, center, #235e39, 1, \it Zizania latifolia, chr08
0.6, 0.6, 0, left, center, #4b0082, 1, \it Zizania palustris, chr06
0.6, 0.4, 0, left, center, #ff7f00, 1, \it Oryza sativa, chr02
# edges
e, 0, 1
e, 1, 2
Now, the figure looks like this:
If you want to change the color of the species names to black, that's also very easy to do. The HEX codes usesd so far in the blocks.layout
file were used to match the colors used to represent each species in the genome paper for consistency.
The blocks.layout
file will look like this:
# x, y, rotation, ha, va, color, ratio, label, chr
0.6, 0.8, 0, left, center, black, 1, \it Zizania latifolia, chr08
0.6, 0.6, 0, left, center, black, 1, \it Zizania palustris, chr06
0.6, 0.4, 0, left, center, black, 1, \it Oryza sativa, chr02
# edges
e, 0, 1
e, 1, 2
Now, the figure will look like this:
Another issue with the figure is that it still uses the "FUN" gene names that were the original gene names from the FUNannotate pipeline. We've since adopted a "ZPchrXXXX" nomenclature. This wasn't done for a variety of reasons and the nuances might be too detailed to mention here, but the steps to change the names in the figure are simple.
The gene names need to be switched in two file:
claudia_blocks_ordered
and
3species_micro-collinearity_filtered.bed
I made a copy of claudia_blocks_ordered
and named it claudia_blocks_ordered_ZP_names
. I manually edited this file with vi
because it is so small. There's only 10 gene names that needed to be switched. I looked up the "FUN" gene names in the file 201003_updated_annotation_file.csv
in Excel and used the corresponding "ZP" gene to replace the "FUN" gene in claudia_blocks_ordered_ZP_names
. So, FUN_030362
became ZPchr0006g40673
, FUN_030363
became ZPchr0006g45140
, and so forth. One thing to note is that the "ZP" genes are not in the same numeric order as the "FUN" genes. I'm not sure why, but I don't think it matters that much. Probably just an artefact of how the gene names were changed.
Editing the bed
file is a bit trickier since it is so much larger. It also contains far more genes than we actually need to change. The file claudia_blocks_ordered_ZP_names
basically tells MCscan which genes we are interested in looking at while the bed
file has the coordinates that are important for mapping. So, we only really need to change the "FUN" gene names that appear in the file claudia_blocks_ordered_ZP_names
.
To achieve this, I used the following code on the command line:
sed -i 's/FUN_030362/ZPchr0006g40673/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030363/ZPchr0006g45140/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030364/ZPchr0006g41229/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/novel_gene_1089_5e3b9126/ZPchr0006g45307/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/novel_gene_140_5e3b9126/ZPchr0006g43296/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030365/ZPchr0006g43754/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030366/ZPchr0006g43762/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030367/ZPchr0006g43262/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030368/ZPchr0006g43948/g' 3species_micro-collinearity_filtered_copy.bed
This will do a search-and-replace for the "FUN" genes and replace them with their "ZP" gene equivalent.
Now when we run MCscan, we can use these slightly altered files and we will get the same figure as above, but with "ZP" names:
python -m jcvi.graphics.synteny claudia_blocks_ordered_ZP_names 3species_micro-collinearity_filtered_copy.bed blocks.layout --genelabelsize=5 --glyphcolor=orthogroup
Another thing (maybe the last thing?) to do to improve the figure is to add the chromosome number next to the species name. This is easy enough to achieve. For example, all you need to do is add "Chr08" after Zizania latifolia or "Chr06" after Zizania palustris in the blocks.layout
file. However, the font is still italicized and I think it would look better if the chromosome label were in standard font. The way to achieve this is to add \normalfont
after the species name, but before the chromosome number. So now your blocks.layout
file should look like this:
# x, y, rotation, ha, va, color, ratio, label, chr
0.6, 0.8, 0, left, center, black, 1, \it Zizania latifolia \normalfont Chr08, chr08
0.6, 0.6, 0, left, center, black, 1, \it Zizania palustris \normalfont Chr06, chr06
0.6, 0.4, 0, left, center, black, 1, \it Oryza sativa \normalfont Chr02, chr02
# edges
e, 0, 1
e, 1, 2
And the figure should look like:
Ok, so Claudia would ideally like the chromosome label on the same line at the genomic interval (given in Mb). Theoretically, this should be possible but it would require edited the source code (like I've done for a few different figures in the past).
Depending on which version of Python you are using, the source code might either be in:
/home/jkimball/haasx092/.local/lib/python3.7/site-packages/jcvi/graphics
For Python 3.7
or
/home/jkimball/haasx092/.local/lib/python3.8/site-packages/jcvi/graphics
For Python 3.8
I'm making every effort to make sure everyone in the lab will have permission to access these files in my system, but if you can't access them, you can always download your own instance of MCscan.
Back to the main problem at hand. Based on the code/commands used to create the figure, I think the script that needs to be edited is synteny.py
. I found a note that might be relevant on lines 339-344 (in the Python 3.8 version).
# TODO: I spent several hours on trying to make this work - with no
# good solutions. To generate labels on multiple lines, each line
# with a different style is difficult in matplotlib. The only way,
# if you can tolerate an extra dot (.), is to use the recipe below.
# chr_label = r"\noindent " + markup(chr) + r" \\ ." if chr_label else None
# loc_label = r"\noindent . \\ " + label if loc_label else None
I made the changes suggested to lined 346 and 347 (just below this comment), but it didn't seem to alter the behavior. Actually, the authors might have intended for users to change the behavior of the program by commenting out the code on line 346 and 347 and removing the pound signs on lines 343-344. (Note: line numbers are different in the Python 3.7 version --lines 283-288--. It's possible that a different (dependent) script actually has the code that needs to be modified. What I'm looking for is code that reads the position information from the bed
file and prints it the graphics device. The chromosome IDs are also in the bed
file, so it should be simple to extract that information from the file and print it at the same time. If I can find the part of the code that does that.