Skip to content

MatthewHaas/synteny_plot_tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Synteny plot (comparative genomics) tutorial

This tutorial is meant for members of the Kimball Lab at the University of Minnesota (Northern Wild Rice Conservation and Breeding) to understand how I made synteny plots using MCscan. The original tutorial by the author (Haibao Tang) can be found here. This is an excellent resource (and is where I learned how to use the program). The tutorial given here is more focused on the needs of our program.

Directory

  1. Zizania palustris vs. Oryza sativa
  2. Zizania palustris vs. Zizania latifolia (2015 version)
  3. Zizania palustris vs. Zizania latifolia (2022 version)
  4. Three species comparison
  5. Coloring specific links
  6. Microsynteny
  7. Claudia microsynteny

Zizania palustris vs Oryza sativa

Depending on when you return to this tutorial, you may find that a script no longer works. When I came back to these scripts in January 2022 to put this tutorial together, the scripts (which previously worked) failed. The problem was that the system could not find the module files. The reason was that module load python was too generic. The default version became python 3.8 and I had installed the software under version 3.7. I fixed it by specifying module load python3/3.7.4_anaconda2019.10 in the script. Once I did that, I had no issues whatsoever.

There are three lines in the script that use sed. They are there to remove superfluous trailing strings in the bed and cds files that cause identical gene names to not be recognized as identical.

sed -i  's/\.MSUv7.0//g' oryza.bed
sed -i  's/\..*$//g' oryza.cds
sed -i 's/\-.*$//g' wild_rice.cds

The last line of the script called run_jcvi.sh is what ultimately generates the figure. Note that seqids and layout refer to files that provide vital information for how the figure appears.

python -m jcvi.graphics.karyotype seqids layout --format png

Here are some details about the layout file specifications:

  • #4b0082 #4b0082 is for Zizania palustris
  • #ff7f00 #ff7f00 is for Oryza sativa

Note: As of this writing (14 January 2022), GitHub does not support colorizing text using HTML or other various methods that I found to insert color. The method that finally worked was to use placeholder.com. If previews of the color are unavaible, I apologize for that but the reason is likely a broken link. It's a minor thing, but feel free to report it to me if you encounter that issue.

When you're done, the finished figure should look like this:

Zizania palustris vs Zizania latifolia (2015 version)

This figure was made using the 2015 version of the Z. latifolia genome. You can find that paper here or using the following citation:

Guo L., Qiu J., Han Z., Ye Z., Chen C., Liu C., Xin X., et al. (2015) A host plant genome (Zizania latifolia) after a century-long endophyte infection. Plant J. 83, 600-609

Pay special attention to the seqids file because it is different than the one used in the comparison with O. sativa owing the the fact that Z. latifolia genome has a different number of chromosomes than O. sativa and the first version of the Z. latifolia genome is more fragmented than the Z. palustris genome. That is, there are more scaffolds than chromosomes/pseudochromosomes. Look at the seqids_latifolia file to see what I mean. This is why there are two tracks for Z. latifolia: one above the Z. palustris track and the other below it. Note: When I ran this in my MSI account, the file was just called seqids but I renamed it here to avoid a file duplication conflict with the version used in the comparison with O. sativa. The same principle is true for the layout_latifolia file.

Colors are specified in the layout file. We chose our color scheme after seeking consensus within the group on colors we liked and to be internally consistent. That means we use the same color for the same species across multiple figures within the same paper.

  • #235e39 #235e39 is for Zizania latifolia
  • #4b0082 #4b0082 is for Zizania palustris

When you're done, the finished figure should look like this:

Zizania palustris vs Zizania latifolia (2022 version)

The data were downloaded from here using the following commands:

wget https://download.cncb.ac.cn/gwh/Plants/Zizania_latifolia_Zlat_genome_v1_GWHBFHI00000000/GWHBFHI00000000.gff.gz --no-check-certificate

Note: The --no-check-certificate was added because the MSI system prompted me to add it.

After retrieving the necessary data, I converted (renamed) the GFF file to GFF3 format. This is because I was having problems with the next steps (finding anchors) and thought this had something to do with it. The problem, I thought, was that the proper gene names (beginning with "Zla") were not being inserted into the BED file; instead, the gene names all began with "gene_" followed by some number. Ultimately, this did not resolve my issue, but since it fixed the problem of bizarre gene names being inserted into the BED file, I decided to leave it.

mv GWHBFHI00000000.gff.gz GWHBFHI00000000.gff3.gz

The next step is to convert the GFF file to BAM format:

python -m jcvi.formats.gff bed GWHBFHI00000000.gff.gz -o latifolia_version_2.bed

Next:

python -m jcvi.formats.fasta format GWHBFHI00000000.CDS.fasta.gz latifolia_version_2.cds

The major roadblock in completing the analysis was that the primary gene names in the .cds file did not match up with the "Zla" gene names in the .bed file (or the "gene_" names for that matter). The "Zla" names were included further along in the .cds file headers so all that was needed to grab them was to remove a bunch of the extraneous information in the header of each gene. Once all of that was removed, the "Zla" gene names would be recognized as the primary gene name. In order to fix the .cds file, you will need to use the following sed one-liner:

sed -i 's/^.*OriGeneID=/>/g' latifolia_version_2.cds

Before making the karyotype figure, we need to change (simplify) the chromosome names in the Z. latifolia .bed file:

sed -i 's/GWHBFHI00000001/Chr1/g' latifolia_version_2.bed

Note: Regarding the order of the Zizania latifolia chromsome order: the authors assigned chromosome numbers based on size in decreasing order. This explains why our chromosome numbers do not match up despite the close evolutionary relationship of Zizania latifolia and Zizania palustris. We assigned our chromosome numbers based on their synteny with Oryza sativa. At the time of our publication, the updated Zizania latifolia genome was not available.

Three species comparison

If we wanted to make another plot showing synteny between all three species (Zizania palustris, Zizania latifolia, and Oryza sativa), we only need to modify small aspects of the code we've already run. The .bed and .cds files already exist. The layout and seqids files only need slight modifications. To keep things clean, I created a new subdirectory to work in: /home/jkimball/haasx092/other_synteny_figures/three_species_plot and copied all of the relevant .bed and .cds files to this directory. I also copied the existing layout and seqids files to this directory before modifying them. The basic layout file looks like this:

#y, xstart, xend, rotation, color, label, va, bed, label_va
0.8, 0.18, 0.98, 0, #235e39, \it Z. latifolia \space -, top, latifolia_version_2.bed, center
0.6, 0.18, 0.98, 0, #4b0082, \it Z. palustris \space -, top, wild_rice.bed, center
0.4, 0.18, 0.98, 0, #ff7f00, \it O. sativa \space -, top, oryza.bed, center
# edges
e, 0, 1, wild_rice.latifolia_version_2.anchors.simple
e, 1, 2, wild_rice.oryza.anchors.simple

Note: The syntax for the labels are a little weird (for lack of a better word in my opinion). The \it aspect is LaTeX syntax for converting the text that follows it to italics (which is necessary for a Latin binomial) and the \space LaTeX syntax represents an attempt to add space after the species name because the text was running into the graphical representation of the first chromosome for each species. I'm not convinced that it was actually doing anything because even when I added 5 of these, I still had this problem. Other LaTeX solutions for adding spaces resulted in errors, but this completed without error so I left it in. Anyway, I think it was the inclusion of the hyphen (-) that actually soved my issue. It's also not really visible in the final figure unless you choose to rotate the position of the chromosomes. This is also why the genus names are not completely spelled out-to shorten the length of the name to avoid having it run into the graphical representation of the chromosomes. I did not previously have this issue, so I don't know why I am having it now-but I am. I also tried to play around with the xstart position, but that just resulted in the label running off the left side of the page. You might also notice that the y positions denote where along the y-axis each track is drawn (e.g., 0.4, 0.6, and 0.8). This is why there will be superfluous white space at the bottom of each figure. I would recommend cropping that white space out using any editor of your choice. Especially if you want to put the final figure into a paper where it won't be visible, except that it could affect formatting in a multi-panel figure or if it would need to be shrunk to avoid overlapping with text.

Now for the seqids file:

Chr1,Chr2,Chr3,Chr4,Chr5,Chr6,Chr7,Chr8,Chr9,Chr10,Chr11,Chr12,Chr13,Chr14,Chr15,Chr16,Chr17
Chr1,Chr2,Chr3,Chr4,Chr5,Chr6,Chr7,Chr8,Chr9,Chr10,Chr11,Chr12,Chr13,Chr14,Chr15
Chr1,Chr2,Chr3,Chr4,Chr5,Chr6,Chr7,Chr8,Chr9,Chr10,Chr11,Chr12

All this file really does is specify the chromosome labels for each species. The top line (track 0) represents Zizania latifolia, the middle line (track 1) represents Zizania palustris, and the third line (track 2) represents Oryza sativa. They can be in whichever order you like, but the layout and seqids files need to agree with each other.

Note: Since we are reusing exising files (because why reinvent the wheel?), you might forget that you need to load the required programs. So, remember to load all of the necessary programs before you try to run any code 😃

export PATH=$PATH:/home/jkimball/haasx092/synteny_figure/last-1060/src
export PATH=$PATH:/home/jkimball/haasx092/synteny_figure/install-tl-20200505/1/bin/x86_64-linux

module load python

Then, all you need to run is:

python -m jcvi.graphics.karyotype seqids layout --format png

This "default" figure will look like this:

You can rotate whichever chromosome track you like by altering the the "rotation" parameter in the layout file:

#y, xstart, xend, rotation, color, label, va, bed, label_va
0.8, 0.18, 0.98, 15, #235e39, \it Z. latifolia \space -, top, latifolia_version_2.bed, center
0.6, 0.18, 0.98, 0, #4b0082, \it Z. palustris \space -, top, wild_rice.bed, center
0.4, 0.18, 0.98, -15, #ff7f00, \it O. sativa \space -, top, oryza.bed, center
# edges
e, 0, 1, wild_rice.latifolia_version_2.anchors.simple
e, 1, 2, wild_rice.oryza.anchors.simple

After modifying the layout file, run the same one line of code as you did before:

python -m jcvi.graphics.karyotype seqids layout --format png

Results in:

If you would rather have straight lines showing connections between syntenic regions (rather than Bezier curves), simply add --shadestyle=line to the code that creates the figure.

python -m jcvi.graphics.karyotype seqids layout --format png --shadestyle=line

Results in:

Coloring specific links

At some point, you may want to add color between links to draw your audience's attention to one or a few specific links. You can do this by making small edits to the .simple file(s). Also just for the record, the default grey is called "gainsboro" in the python named color universe. (I didn't know where else to mention this or if anyone would care, but I found it while investigating how the karyotype.py script works, so I thought I would mention it.

You add color by adding r* before the genes that you want to highlight. Below is a sample excerpt from a .simple file. I am not including a full .simple file in this README document because it would take up unnecessary space and I think this gets the point across. The "r" will make it red. If you want, you can pick other colors like blue (b), green (g), cyan (c), magenta (m), yellow (y), black (k), or white (w). You can also specify any color you want using its HEX code. More on that below.

FUN_001229      FUN_001281      LOC_Os09g27940  LOC_Os09g28560  52      -
FUN_001289      FUN_001454      LOC_Os09g24540  LOC_Os09g27590  205     -
FUN_001457      FUN_001632      LOC_Os09g36140  LOC_Os09g39870  247     -
FUN_001647      FUN_001719      LOC_Os09g33510  LOC_Os09g36040  103     -
FUN_002470      FUN_002503      r*LOC_Os01g39890  LOC_Os01g40700  48      -
FUN_002506      FUN_002599      r*LOC_Os01g36070        LOC_Os01g39630  146     -
FUN_002675      FUN_002796      r*LOC_Os01g16090        LOC_Os01g19970  173     +
FUN_002804      FUN_002841      r*LOC_Os01g20970        LOC_Os01g22954  66      +
FUN_002883      FUN_002979      r*LOC_Os01g13100        LOC_Os01g15290  130     +
FUN_002984      FUN_003152      r*LOC_Os01g09890        LOC_Os01g13040  212     +
FUN_003189      FUN_003284      r*LOC_Os01g07810        LOC_Os01g09770  127     +
FUN_003318      FUN_003365      r*LOC_Os01g05430        LOC_Os01g06640  71      -
FUN_003391      FUN_003453      r*LOC_Os01g01890        LOC_Os01g03980  103     +
FUN_003488      FUN_003588      LOC_Os08g01054  LOC_Os08g02860  127     +
FUN_003593      FUN_003761      LOC_Os08g03240  LOC_Os08g06640  213     +
FUN_003768      FUN_003798      LOC_Os08g07400  LOC_Os08g08210  46      +

Using the complete version of the above .simple file will result in the following figure:

Using HEX codes, the .simple file would look something like this:

FUN_014647      FUN_014795      LOC_Os05g50490  LOC_Os05g51860  132     -
FUN_014851      FUN_014891      LOC_Os05g49580  LOC_Os05g50130  47      -
FUN_015145      FUN_015178      LOC_Os02g42406  LOC_Os02g42850  34      -
#235e39*FUN_015215      FUN_015367      LOC_Os01g39770  LOC_Os01g42820  195     -
#235e39*FUN_015373      FUN_015488      LOC_Os01g36580  LOC_Os01g39380  146     -
#235e39*FUN_015503      FUN_015653      LOC_Os01g31110  LOC_Os01g35230  196     -
#235e39*FUN_015708      FUN_015846      LOC_Os01g18420  LOC_Os01g21440  158     +
#235e39*FUN_015892      FUN_016018      LOC_Os01g22336  LOC_Os01g25600  162     +
#235e39*FUN_016056      FUN_016149      LOC_Os01g15470  LOC_Os01g16230  75      +
#235e39*FUN_016190      FUN_016243      LOC_Os01g14310  LOC_Os01g14870  50      -
#235e39*FUN_016235      FUN_016350      LOC_Os01g16240  LOC_Os01g18100  109     -
#235e39*FUN_016439      FUN_016539      LOC_Os01g09590  LOC_Os01g10520  92      +
#235e39*FUN_016564      FUN_016713      LOC_Os01g10680  LOC_Os01g12690  162     +
#235e39*FUN_016715      FUN_016919      LOC_Os01g07080  LOC_Os01g08860  181     -
#235e39*FUN_016923      FUN_017105      LOC_Os01g02860  LOC_Os01g06836  243     -
#235e39*FUN_017109      FUN_017164      LOC_Os01g01010  LOC_Os01g02200  76      -
#235e39*FUN_017179      FUN_018308      LOC_Os01g58080  LOC_Os01g74570  1230    -
#235e39*FUN_018309      FUN_019085      LOC_Os01g45140  LOC_Os01g57940  894     -
#235e39*FUN_019090      FUN_019163      LOC_Os01g42840  LOC_Os01g44360  95      -
FUN_019632      FUN_019675      LOC_Os04g08060  LOC_Os04g09260  50      -
FUN_019707      FUN_019766      LOC_Os04g01070  LOC_Os04g03100  88      - 

Note: I used #235e39 because I think it is a particularly pretty shade of green. #235e39 = #ff7f00

You can also choose to use any legal HTML color that you like (e.g., "burlywood" or "chartreuse").

In addition to color, you can choose various shades of grey by specifying a number between 0 and 1 (e.g., 0.5 as in the example below).

FUN_014647      FUN_014795      LOC_Os05g50490  LOC_Os05g51860  132     -
FUN_014851      FUN_014891      LOC_Os05g49580  LOC_Os05g50130  47      -
FUN_015145      FUN_015178      LOC_Os02g42406  LOC_Os02g42850  34      -
0.5*FUN_015215      FUN_015367      LOC_Os01g39770  LOC_Os01g42820  195     -
0.5*FUN_015373      FUN_015488      LOC_Os01g36580  LOC_Os01g39380  146     -
0.5*FUN_015503      FUN_015653      LOC_Os01g31110  LOC_Os01g35230  196     -
0.5*FUN_015708      FUN_015846      LOC_Os01g18420  LOC_Os01g21440  158     +
0.5*FUN_015892      FUN_016018      LOC_Os01g22336  LOC_Os01g25600  162     +
0.5*FUN_016056      FUN_016149      LOC_Os01g15470  LOC_Os01g16230  75      +
0.5*FUN_016190      FUN_016243      LOC_Os01g14310  LOC_Os01g14870  50      -
0.5*FUN_016235      FUN_016350      LOC_Os01g16240  LOC_Os01g18100  109     -
0.5*FUN_016439      FUN_016539      LOC_Os01g09590  LOC_Os01g10520  92      +
0.5*FUN_016564      FUN_016713      LOC_Os01g10680  LOC_Os01g12690  162     +
0.5*FUN_016715      FUN_016919      LOC_Os01g07080  LOC_Os01g08860  181     -
0.5*FUN_016923      FUN_017105      LOC_Os01g02860  LOC_Os01g06836  243     -
0.5*FUN_017109      FUN_017164      LOC_Os01g01010  LOC_Os01g02200  76      -
0.5*FUN_017179      FUN_018308      LOC_Os01g58080  LOC_Os01g74570  1230    -
0.5*FUN_018309      FUN_019085      LOC_Os01g45140  LOC_Os01g57940  894     -
0.5*FUN_019090      FUN_019163      LOC_Os01g42840  LOC_Os01g44360  95      -
FUN_019632      FUN_019675      LOC_Os04g08060  LOC_Os04g09260  50      -
FUN_019707      FUN_019766      LOC_Os04g01070  LOC_Os04g03100  88      - 

A note regarding using RGB (red, green, blue) color notation:
Another way of specifying colors is with the RGB format or RGBA format where a=alpha, a value indicating how transparent a color should be where 0=fully transparent and 1=fully opaque). In RGB/RGBA format, each value needs to be a float value between 0 and 1 so if you have integers (such as 35,94,57 that matches HEX code #235e39 above) you could convert them to floats by dividing each value by 255. I do not think it is possible to specify a specific color in this particular way. I tried every variation of syntax that I thought would work, but every single one of them resulted in a traceback error. I did some digging through source code and I think I understand why RGB notation cannot be used. The code for MCscan is written in python and as such the colors are ultimately derived from matplotlib. The links are drawn using code contained within the synteny.py script from MCscan. In that script, an object of class Shade() is defined. There are several optional parameters including two relevant parameters called edge color (ec) and face color (fc) with the default set to black (k). In the karyotype.py script, the default value for both ec and fc parameters is changed to "gainsboro" which explains why the default links are grey and not black.

I think the thing that is key here is that the parameters expect strings as input. This is why single-color codes, HEX codes, named colors, and even float values (if you desire greyscale) all work: they can all be interpreted as string values. Even though the greyscale values are technically floats, they can be interpreted as strings before being converted to floats. The RGB notation would be interpreted as a tuple. With this in mind, there is probably a solution to enable the usage of RGB values involving editing the synteny.py source code, but it's probably just easier to translate your RGB values to a HEX code using the internet. 🌐

Microsynteny

In some cases, you may want to look at a specific region more closely, such as the one surrounding the shattering4 (sh4) gene.

Claudia microsynteny

The purpose of this project was to visualize a cluster of genes involved in phytoalexin synthesis in NWR. Claudia created a nice PowerPoint presentation including a comparative genomics slide looking at microsynteny between Zizania palustris, Zizania latifolia, and Oryza sativa. The genomic interval of interest was on Zizania palustris chromosome 6 (ZPchr0006) between 33,535,817 bp and 33,594,019 bp.

That image looked like this:

Using the start/end values of this interval as a starting point, I found several genes in this interval (FUN_030362, FUN_030363, FUN_030364, novel_gene_1089_5e3b9126, novel_gene_140_5e3b9126, FUN_030365, FUN_030366, FUN_030367, FUN_030368) in the file wild_rice_oryza.bed. I did this by searching for the staring point (33535817). You do this by opening the file (less wild_rice_oryza.bed), then hitting the esc (escape) key, then the shift (shift) and : (colon) keys (together). You should see a colon appear in the bottom left corner of the terminal window. Hit the forward slash key (/), then the starting number of the interval (33535817), then hit enter. The first hit is in fact on chromosome 6, so I know that I am getting the right genes/on the right chromosome. The start position is in column 2 and the end position is in column 3. In fact, you can see the end position of the interval (33594019) a few rows down. The genes of interest (in this interval) are in column 4.

In order to identify which genes from the other species correspond to these NWR genes, I had to search the pre-existing anchor files (e.g., wild_rice_filt.latifolia_version_2.lifted.anchors). These are MC-scan specific files that show homologous genes. This is how I created (manually) the helper text file claudia_blocks_ordered.

In order to get the coordinates for each gene, you'll need to use a bed file. I created the file 3species_micro-collinearity_copy.bed by concatenating the component species' bed files:

cat latifolia_version_2.bed wild_rice_oryza.bed > 3species_micro-collinearity_copy.bed

Note: The file wild_rice_oryza.bed already existed and was in the correct order (Zizania palustris then Oryza sativa). If that wasn't the case, you would need to concatenate the individual files (latifolia_version_2.bed, wild_rice.bed, and oryza.bed) in the order that you want them to appear in the bed file.

One thing to pay attention to is that the species in the bed file must occur in the order that you want them to appear in the figure.

After the first attempt at generating this figure, the image looks like this:

Note: The output of the MCscan code is a PDF file. Before uploading it here, I first converted it to PNG format (as I do for all of my GitHub images), but I also cropped out excess whitespace because there is often excess whitespace around MCscan figures (especially the karyotype figures like this one). Also for this reason, some of the example figures shown below might look like they are different sizes. This is because when I cropped out the excess white space, I didn't pay super close attention to ensuring that I cropped out exactly the same amount of white space each time. I figured it isn't really necessary for purposes of this tutorial (vs. preparing a figure for publication in a manuscript).

In the initial figure, you'll notice that the Zizania latifolia and Oryza sativa chromosomes have gene names from other species overlapping with gene names from the appropriate species. This should not happen. I think the reason this happened is because MCscan was pulling information about which chromosomes to use from the blocks.layout file. Because both Zizania latifolia and Oryza sativa have similar genomic intervals (22.55-22.62 Mb and 21.66-21.86 Mb, respectively) despite being on different chromosomes (Chr08 and Chr02, respectively), MCscan pulled information from Oryza sativa chromosome 8 in the bed file and plotted it along Zizania latifolia chromosome 8 genes. It did the same with Zizania latifolia chromosome 2 genes--plotting them alongside Oryza sativa chromosome 2 genes. The Zizania palustris chromosome 6 did not plot gene names from either of the other species because its interval (33.54-33.59 Mb) must not have any genes in the bed file from those species.

I thought that filtering out all of the unused/irrelevant information from the bed file would solve this problem. After trying it to see if I was right, I found that I was indeed right, so I will describe the steps that I took to achieve this next.

Filtering the bed file:

awk '$4~/^Zla08/' 3species_micro-collinearity.bed >> 3species_micro-collinearity_filtered.bed
awk '$1~/Chr6/ && $4~/^FUN/' 3species_micro-collinearity.bed 3species_micro-collinearity.bed >> 3species_micro-collinearity_filtered.bed
awk '$4~/^LOC_Os02/' 3species_micro-collinearity.bed >> 3species_micro-collinearity_filtered.bed

Note: I used the >> symbol so I could iteratively add (append) selections for each specific species/chromosome pair without overwriting the previous work, which would have happened if I used the > symbol (writes the "standard output" from a Unix command to a file). Here's the logic of the AWK code: Field 4 ($4) in the bed file contains the gene names and both the Zizania latifolia and Oryza sativa gene names indicate both species (e.g.,"Zla" or "LOC_Os") and chromosome number (e.g., "08" or "02"), so I was able to select on both factors simultaneously. Because the gene names for Zizania palustris are the original gene names from the FUNannotate pipeline, they have the prefix "FUN" and so do not contain any reference to species or chromosome number. This is why I also had to select the first field/column ($1) to select for the chromosome that we wanted.

There was another change that I had to make to improve upon the initial figure. In the initial figure, you'll notice that the links go from the Zizania latifolia chromosome directly to that of Oryza sativa. There are no links from Zizania palustris to Oryza sativa. That's because in the initial blocks.layout file, I the line e, 0, 1 repeated twice under #edges. Changing the second line to e, 1, 2 fixed this. Here is an excerpt from the MCscan tutorial:

Like the layout file in the macro-synteny section, edges stanza say connecting grape (column 0) with peach (column 1) and grape (column 0) with cacao (column 2).

That figure has one chromosome (grape) on top and two chromsomes (peach and cacao) next to each other on the . In any case, I think you should interpret the blocks.layout file shown below as "connect Zizania latifolia to Zizania palustris (e.g., 0,1) and connect Zizania palustris to Oryza sativa (e.g., 1,2)."

The top three lines (apart from the header line) indicate where the chromosome lines themselves are oriented. I think the x-axis value (0.6) for all ends up being the center of the chromosome while the y-axis values (0.8, 0.6, and 0.4) orient the chromosomes vertically. I set horizontal alignment ("ha") to be "left", but I think the lengths of some of the chromosomes (especially Oryza sativa) impact this in practice.

The blocks.layout file should look like this:

# x, y, rotation, ha, va, color, ratio, label, chr
0.6, 0.8, 0, left, center, #235e39, 1, Zizania\ latifolia, chr08
0.6, 0.6, 0, left, center, #4b0082, 1, Zizania\ palustris,  chr06
0.6, 0.4, 0, left, center, #ff7f00, 1, Oryza\ sativa, chr02
# edges
e, 0, 1
e, 1, 2

(The species names are currently not in italics like I think they should be, but that's a problem for another day.)

Another thing that we might want to do is change the color of the genes to show homology between species. The default colors are blue and green which I think represent the strand that the gene is on. In order to change the color of the genes, we simply add --glyphcolor=orthogroup to the script so that it now looks like this:

python -m jcvi.graphics.synteny claudia_blocks_ordered 3species_micro-collinearity_filtered.bed blocks.layout --genelabelsize=5 --glyphcolor=orthogroup

Now, the figure looks like this:

Now that some of the bigger issues with the figure have been addressed, it is time to figure out how to italicize the species names (to fit with convention). Simplly adding \it in front of each species name (and removing the \ after the genus names is enough to do achieve this.

The blocks.layout file should now look like this:

# x, y, rotation, ha, va, color, ratio, label, chr
0.6, 0.8, 0, left, center, #235e39, 1, \it Zizania latifolia, chr08
0.6, 0.6, 0, left, center, #4b0082, 1, \it Zizania palustris,  chr06
0.6, 0.4, 0, left, center, #ff7f00, 1, \it Oryza sativa, chr02
# edges
e, 0, 1
e, 1, 2

Now, the figure looks like this:

If you want to change the color of the species names to black, that's also very easy to do. The HEX codes usesd so far in the blocks.layout file were used to match the colors used to represent each species in the genome paper for consistency.

The blocks.layout file will look like this:

# x, y, rotation, ha, va, color, ratio, label, chr
0.6, 0.8, 0, left, center, black, 1, \it Zizania latifolia, chr08
0.6, 0.6, 0, left, center, black, 1, \it Zizania palustris,  chr06
0.6, 0.4, 0, left, center, black, 1, \it Oryza sativa, chr02
# edges
e, 0, 1
e, 1, 2

Now, the figure will look like this:

Another issue with the figure is that it still uses the "FUN" gene names that were the original gene names from the FUNannotate pipeline. We've since adopted a "ZPchrXXXX" nomenclature. This wasn't done for a variety of reasons and the nuances might be too detailed to mention here, but the steps to change the names in the figure are simple.

The gene names need to be switched in two file: claudia_blocks_ordered and 3species_micro-collinearity_filtered.bed

I made a copy of claudia_blocks_ordered and named it claudia_blocks_ordered_ZP_names. I manually edited this file with vi because it is so small. There's only 10 gene names that needed to be switched. I looked up the "FUN" gene names in the file 201003_updated_annotation_file.csv in Excel and used the corresponding "ZP" gene to replace the "FUN" gene in claudia_blocks_ordered_ZP_names. So, FUN_030362 became ZPchr0006g40673, FUN_030363 became ZPchr0006g45140, and so forth. One thing to note is that the "ZP" genes are not in the same numeric order as the "FUN" genes. I'm not sure why, but I don't think it matters that much. Probably just an artefact of how the gene names were changed.

Editing the bed file is a bit trickier since it is so much larger. It also contains far more genes than we actually need to change. The file claudia_blocks_ordered_ZP_names basically tells MCscan which genes we are interested in looking at while the bed file has the coordinates that are important for mapping. So, we only really need to change the "FUN" gene names that appear in the file claudia_blocks_ordered_ZP_names.

To achieve this, I used the following code on the command line:

sed -i 's/FUN_030362/ZPchr0006g40673/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030363/ZPchr0006g45140/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030364/ZPchr0006g41229/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/novel_gene_1089_5e3b9126/ZPchr0006g45307/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/novel_gene_140_5e3b9126/ZPchr0006g43296/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030365/ZPchr0006g43754/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030366/ZPchr0006g43762/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030367/ZPchr0006g43262/g' 3species_micro-collinearity_filtered_copy.bed
sed -i 's/FUN_030368/ZPchr0006g43948/g' 3species_micro-collinearity_filtered_copy.bed

This will do a search-and-replace for the "FUN" genes and replace them with their "ZP" gene equivalent.

Now when we run MCscan, we can use these slightly altered files and we will get the same figure as above, but with "ZP" names:

python -m jcvi.graphics.synteny claudia_blocks_ordered_ZP_names 3species_micro-collinearity_filtered_copy.bed blocks.layout --genelabelsize=5 --glyphcolor=orthogroup

Another thing (maybe the last thing?) to do to improve the figure is to add the chromosome number next to the species name. This is easy enough to achieve. For example, all you need to do is add "Chr08" after Zizania latifolia or "Chr06" after Zizania palustris in the blocks.layout file. However, the font is still italicized and I think it would look better if the chromosome label were in standard font. The way to achieve this is to add \normalfont after the species name, but before the chromosome number. So now your blocks.layout file should look like this:

# x, y, rotation, ha, va, color, ratio, label, chr
0.6, 0.8, 0, left, center, black, 1, \it Zizania latifolia \normalfont Chr08, chr08
0.6, 0.6, 0, left, center, black, 1, \it Zizania palustris \normalfont Chr06,  chr06
0.6, 0.4, 0, left, center, black, 1, \it Oryza sativa \normalfont Chr02, chr02
# edges
e, 0, 1
e, 1, 2

And the figure should look like:

Ok, so Claudia would ideally like the chromosome label on the same line at the genomic interval (given in Mb). Theoretically, this should be possible but it would require edited the source code (like I've done for a few different figures in the past).

Depending on which version of Python you are using, the source code might either be in:

/home/jkimball/haasx092/.local/lib/python3.7/site-packages/jcvi/graphics

For Python 3.7
or

/home/jkimball/haasx092/.local/lib/python3.8/site-packages/jcvi/graphics

For Python 3.8

I'm making every effort to make sure everyone in the lab will have permission to access these files in my system, but if you can't access them, you can always download your own instance of MCscan.

Back to the main problem at hand. Based on the code/commands used to create the figure, I think the script that needs to be edited is synteny.py. I found a note that might be relevant on lines 339-344 (in the Python 3.8 version).

   # TODO: I spent several hours on trying to make this work - with no
   # good solutions. To generate labels on multiple lines, each line
   # with a different style is difficult in matplotlib. The only way,
   # if you can tolerate an extra dot (.), is to use the recipe below.
   # chr_label = r"\noindent " + markup(chr) + r" \\ ." if chr_label else None
   # loc_label = r"\noindent . \\ " + label if loc_label else None

I made the changes suggested to lined 346 and 347 (just below this comment), but it didn't seem to alter the behavior. Actually, the authors might have intended for users to change the behavior of the program by commenting out the code on line 346 and 347 and removing the pound signs on lines 343-344. (Note: line numbers are different in the Python 3.7 version --lines 283-288--. It's possible that a different (dependent) script actually has the code that needs to be modified. What I'm looking for is code that reads the position information from the bed file and prints it the graphics device. The chromosome IDs are also in the bed file, so it should be simple to extract that information from the file and print it at the same time. If I can find the part of the code that does that.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages