Whole Genome Assembly and Annotation of Northern Wild Rice (Zizania palustris L.), a North American Grain Supports a Whole Genome Duplication Event within the Species
This repository supports the work that went into characterizing the Northern Wild Rice (Zizania palustris L.) genome.
The manuscript was posted to bioRxiv on 12 March 2021 and can be found here. Upon publication, that link will be shared as well.
The annotation of the NWR genome was performed by Marissa Machietto and Thomas Kono from the Minnesota Supercomputing Institute - Research Informatics Support (MSI-RIS). Those scripts are located here.
There are no scripts for the genome assembly itself because that work was performed by Dovetail Genomics. The genome has been deposited at the United States National Center for Biotechnology Information (NCBI) under BioProject number PRJNA600525.
Please use the directory to navigate this README to find the scripts for a particular analysis or figure with ease. This README is best viewed in Light Mode.
- Figure 1
- Figure 2
- Figure 3
- Figure 4
- Supporting Figure S1
- Supporting Figure S2
- Supporting Figure S3
- Supporting Figure S4
- Supporting Figure S5
- Supporting Figure S6
- Supporting Figure S7
- Supporting Figure S8
- Other scripts
This figure shows the phylogenetic relationship between 20 species used in the initial OrthoFinder analysis. The data to make the tree comes from OrthoFinder and the tree was created using Dendroscope. This is not a command-line program, so there is not code to share, but the file we used as input fo the program was SpeciesTree_rooted.txt. The version of the file with node labels is called SpeciesTree_rooted_node_labels.txt. Divergence times from OrthoFinder were added manually and are in units of million years ago (MYA).
Related to this figure, we estimated the divergence time between Z. palustris and O. sativa. To estimate the divergence time, we used the program mcmctree
from Phylogenetic Analysis by Maximum Liklihood (PAML) version 4. The script to do this is run_paml.sh.
This figure shows the number of orthogroups in common with (and private to) NWR and four other major grass species (Oryza sativa, Zea mays, Sorghum bicolor, and Brachypodium distachyon). The data come from an independent run of OrthoFinder so that the orthogroup counts shown in the figure would only include orthogroup shared by these species and none from the larger set of 20 species shown in the species tree. (Note: each orthogroup may contain one or more genes.) The native venn.diagram
function from the VennDiagram package does not use a comma as a separator for the thousands place, so we modified the function and saved the script as modified_venn_diagram.R to force the function to use a comma separator. The script used to generate this figure is called venn_diagram_orthogroups.R. It uses the source()
function to call the modified version of the venn.diagram
function. You must include the modified_venn_diagram.R script in your working directory (or give the path to its location) in order for it to function properly.
Figure 1C was created using the MCscan program (run_jcvi.sh). I called the program JCVI in the script name rather than MCscan because the scripts are found in a directory called jcvi in the GitHub repository for the MCscan code. You should use the seqids and layout files in conjuntion with run_jcvi.sh, but note that you should just call them seqids and layout; and sequester them away from the other seqids and layout files used for the comparison with Z. latifolia. I changed the names here only so that the files would not have the same name which would overwrite the other.
The karyotype.py script was originally written by Haibao Tang and can be found here. I am including the script here because I modified it in order to create my plots.
- Line 40 was altered so that
arg[5]
(the name we assign to each track in the layout file) is printed in italics. - Line 239 was also changed (dividing vpad by 2 was removed) to make extra room on the margin so that Zizania palustris could be fully written out (versus abbreviating it as Z. palustris--which also didn't fit initally--it ran into the representations of the chromosomes.
This figure was created using the MCscan program. The script that I ran on the server was run_jcvi.sh. I called the program JCVI in the script name rather than MCscan because the scripts are found in a directory called jcvi in the GitHub repository for the MCscan code.
The dotplot.py script was originally written by Haibao Tang and can be found here. I am including the script here because I modified it in order to create my plots. The following changes were made by hard-coding my desired output into the original script:
- The font color of the chromosome labels and positions were changed from grey to black
- The labels for the x and y axes were changed to Zizania palustris and Oryza sativa (respectively) rather than
wild_rice
andoryza
(which are theBED
file names) - The xlimit was slightly increased (along with the length of chr 15, scf 16, and scf 458 (in order to make the chromosome labels legible).
The synteny.py script was originally written by Haibao Tang and can be found here. I am including the script here because I modified it in order to create my plots. The shell script that I wrote to generate the figure is called run_micro-collinearity.sh. The blocks.layout file should be included with run_micro-collinearity.sh to work properly.
- Line 61 was modified so that the species label
args[7]
will be printed in italics. - I also added another argument
args[8]
so that the chromosome label will not be in italics.
This figure shows one of the genes important for shattering called shattering4 (sh4) in O. sativa. sh4 is in the center and we opted to plot 10 genes on other side of sh4 in O. sativa as well as its putative ortholog in Z. palustris. The green/blue colors indicate which strand the gene exists on.
Like Figure 1C, Figure 1F was created using the MCscan program (run_jcvi_with_latifolia.sh). The script is eseentially the same, except that it compares NWR to Z. latifolia rather than O. sativa. I called the program JCVI in the script name rather than MCscan because the scripts are found in a directory called jcvi in the GitHub repository for the MCscan code. The layout and seqids files are used by the script and must be in the same directory as run_jcvi_with_latifolia.sh to work properly. Like the case for the seqids and layout files for showing synteny between Z. palustris and O. sativa in Figure 2C, just call the files seqids and layout--but make sure they are kept separate.
The karyotype.py script was originally written by Haibao Tang and can be found here. I am including the script here because I modified it in order to create my plots.
- Line 40 was altered so that
arg[5]
(the name we assign to each track in the layout file) is printed in italics. - Line 239 was also changed (dividing vpad by 2 was removed) to make extra room on the margin so that Zizania palustris could be fully written out (versus abbreviating it as Z. palustris--which also didn't fit initally--it ran into the representations of the chromosomes.
This script generated the Circos plot shown in Figure 1. The figure shows the genome-wide distribution of genes and repetitive elements. The legend was added in PowerPoint. This shell script run_main_circos.sh is used in conjunction with the Circos configuration file which you can find here.
This figure shows the distribution of synonymous substitution rates and ratios of orthologs between Z. palustris and O. sativa; and between Z. palustris and Z. latifolia.
This script generated the Circos plot shown in Figure 4. The figure shows SNP density at 2-, 4-, and 8-fold downsampling levels. The legend was added in PowerPoint. This shell script run_downsampled_circos.sh is used in conjunction with the Circos configuration file which you can find here. The scripts used to find SNPs (and thus allow us to calculate their density) can be found here.
The purpose of this code was to add a scale bar to the image of tissues collected for the RNA-seq portion of the study. This work was done in a Jupyter Notebook using Python. The Jupyter Notebook file is add_scalebar_to_annotation_photo.ipynb. The letters were added in PowerPoint.
The script get_pacbio_seqids.sh extracts the headers/SeqIDs from each FASTQ
file containing the PacBio reads and writes the headers/SeqIDs to a plain text file and concatenates all eight of these files into a single text file called all_headers_concat.txt
. The python script get_pacbio_read_lengths.py is launched using the shell script run_get_pacbio_read_lengths.sh. The python script get_pacbio_read_lengths.py not only extracts the length of each PacBio read from the file all_headers_concat.txt
, but it plots them using a histogram (shown below). The file name of the plot has been hard-coded into the python script.
Plots show basic genome assembly statistics. There is no code for these figures. They are output figures from QUAST (Quality Assessment Tool for Genome Assemblies).
The script WR_repeats_karyoplot.R was used to generate these figures.
This figure came from Marissa & Tom. The code used to generate it can be found here. The letters were added in PowerPoint.
This combined figure came from multiple R scripts used to parse gene ontology (GO) data and plot the most abundant GO terms. The R scripts used to generate this figure are located in the gene_ontology subdirectory. They include a modified version of the pie chart function. That script (modified_pie_function.R) must be loaded when using these scripts using the source()
function. You should make sure that the modified version of the script is in your working directory, otherwise it will not work properly.
The point of this figure was to create a secondary version of the venn diagram in Figure 2B using rice relatives instead of major grass species. O. sativa is the exception because, like Z. palustris, it is featured in both venn diagrams. O. glaberrima and O. rufipogon were included instead of O. barthii and O. nivara because those pairs of species (O. glaberrima + O. barthii and O. rufipogon + O. nivara) are not too distantly related so we get the same evolutionary relationship and we wanted to avoid a more cluttered figure.
Like the venn diagram in Figure 2B, we used the modified venn.diagram
fuction found in the modified_venn_diagram.R script to use commas as a thousands separator. The script used to create this figure is called venn_diagram_orthogroups_with_rice_relatives.R. The script uses the source()
function to call the modified version of the venn.diagram
function. Ensure that the modified_venn_diagram.R script is in your working directory (or give the path to its location) in order for it to function properly.
We were interested in investigating the types of genes that are unique to Z. palustris so we used the script get_unique_NWR_genes.py to extract the gene names based on their orthogroup IDs. The orthogroup IDs were obtained by parsing the Orthogroup.GeneCount.tsv file so that we would get orthogroups for which O. sativa, O. glbarrima, O. rufipogon, and Z. latifolia had a count of "0" while Z. palustris had a count greater than "0". This resulted in 1,731 orthogroup IDs as expected from the venn diagram. Our script get_unique_NWR_genes.py uses the text file NWR_unique_orthogroup_list.txt containing these 1,731 orthogroup IDs (one per line) and extracts SeqIDs (using BioPython) from the orthogroup protein FASTA
files located in the OrthoFinder output Orthogroup_Sequences
directory. The output contains 6,624 gene names which are found in this file. There is no associated shell script to launch this python script. Instead, I just ran it on the command line. First, I loaded python 3 with the command module load python3
. Second, I ran the script with the following command:
python get_unique_NWR_genes.py NWR_unique_orthogroup_list.txt list_of_NWR_unique_genes.txt
Note: the file names for sys.argv[1]
and sys.argv[2]
can really be anything you choose, but since sys.argv[1]
is the input for the python script, it must exist. The file name for sys.argv[2]
is somewhat arbitrary, but it should be meaningful.
This figure is for the tissue specificity work. The code to generate these figures was written by Tom Kono and can be found here. Figures were combined and letters added in PowerPoint to make the compound figure in the manuscript.
The file blast_for_shattering_genes.txt contains the commands used to detect shattering genes in NWR using sequences from Oryza species. These commands were carried out on the command line and not submitted to run through the scheduling system - so there is no PBS or SLURM header. The input FASTA
files and BLAST output files are also located in the blast_for_shattering subdirectory.