-
Notifications
You must be signed in to change notification settings - Fork 53
KEGG Update in RAVEN
This is the step-by-step protocol for developers to update the KEGG database which comes bundled with The RAVEN Toolbox 2. Since the database is updated from KEGG FTP dump files, it is necessary to have KEGG FTP Subscription. The protocol includes a detailed overview of how to generate a new KEGG database and KEGG Orthology (KO) specific Hidden Markov Model (HMM) sets from KEGG FTP content. In addition, the key KEGG FTP dump files are saved for later use, so the user can always completely re-run the full KEGG update for RAVEN using these files. This is helpful once the major inconsistencies are found in KEGG mat
files or HMM sets, just in case.
NOTE: The single users who only aim to reconstruct a model from KEGG FTP dump files, should use this guide instead.
NOTE: It is unlikely that a single user has an access to all three systems (macOS, Unix/Linux, Windows), so some collaboration may be necessary between multiple users.
- I-a. Go to https://www.genome.jp/kegg/docs/relnote.html and find the latest KEGG Release number.
-
I-b. Create the following nested structure of directories somewhere in the personal Box Sync:
keggUpdate/euk90_keggXX/keggdb
. In this protocol,XX
indicates the release number. Use the number identified in Step I-a here. It is sufficient to include the major release number, e.g., 88. -
I-c. Log in to KEGG FTP via
ftp.bioinformatics.jp
using SysBio or personal credentials. It is strongly recommended to use FTP client software likeFileZilla
as several files to be downloaded are large. -
I-d. Download all the source files to
keggdb
directory from the following locations in KEGG FTP:/kegg/ligand/reaction.tar.gz
/kegg/ligand/compound.tar.gz
/kegg/ligand/glycan.tar.gz
/kegg/genes/ko.tar.gz
/kegg/genes/fasta/eukaryotes.pep.gz
/kegg/genes/fasta/prokaryotes.pep.gz
/kegg/genes/misc/taxonomy
-
I-e. Extract all the downloaded archives. Only the specific files from each archive are required, they must be placed right in
keggdb
directory. Such step including the clean-up of temporary files can be accomplished usingTerminal
(in Windows, useCygwin
Terminal
instead):for f in *.tar.gz; do tar xf "$f" && rm "$f"; done gunzip *.gz mv reaction/reaction reaction_raw mv reaction/reaction.lst reaction.lst mv reaction/reaction_mapformula.lst reaction_mapformula.lst mv compound/compound compound_raw mv compound/compound.inchi compound.inchi mv glycan/glycan glycan_raw mv ko/ko ko_raw rm -rf reaction && rm -rf compound && rm -rf glycan && rm -rf ko mv reaction_raw reaction && mv ko_raw ko
-
I-f. Concatenate compound and multi-fasta protein files:
cat compound_raw glycan_raw > compound cat eukaryotes.pep prokaryotes.pep > genes.pep rm *_raw && rm *tes.pep
-
I-g. Ensure that
keggdb
folder contains only the following 8 files:compound
compound.inchi
genes.pep
ko
reaction
reaction_mapformula.lst
reaction.lst
taxonomy
-
II-a. Install the latest RAVEN version and its dependencies using the instructions online. Download RAVEN to the local machine using
GitHub Desktop
,GitKraken
or simplyGit
inTerminal
. - II-b. Make sure, that the writing permissions in RAVEN repository are allowed for GitHub user, who is going to commit the changes and create a Pull Request.
-
II-c. In the local RAVEN repository, change the current branch from
main
todevel
. -
II-d. Create a new branch based on
devel
and name it e.g.,fix/keggUpdate
. Then change the current branch to this newly created branch. -
II-e. Run
checkInstallation
to ensure that RAVEN is working fine.
NOTE: During this step, we only consider updating the external binaries required in KEGG-based GEM reconstruction: MAFFT
, CD-HIT
and HMMER
. While the update is super simple for MAFFT
, it takes additional efforts and time to prepare the updates for CD-HIT
and HMMER
. The latter is complicated due to the need to prepare the binaries for all three systems (macOS, Unix/Linux and Windows).
-
III-a. Update
MAFFT
: As it is with version 7.427, the developer distributes the standalone all-in-one packages for macOS, Unix and Windows systems:-
III-a-i. Go to the official
MAFFT
website and identify the latest available version, which is simultaneously available as all-in-one package for macOS, Unix/Linux and Windows. Regarding Unix/Linux, do not considerrpm
,deb
versions, but search for Portable package instead. If the newest version is not available for all three systems, use the older version which is available for all three systems. For instance, in August 2019 the newestMAFFT
version for Windows was 7.429, but since it was not available for the other two systems, the older version 7.427, which was available for all three systems, was considered. -
III-a-ii. Remove all the files from
RAVENdir/software/mafft
directory. -
III-a-iii. Download the three
MAFFT
all-in-one/portable packages identified in Step III-a-i. Also, download the license file (e.g., version 7.427 has a BSD license). Place all the download files inRAVENdir/software/mafft
directory. -
III-a-iv. Extract all three downloaded archives. Ensure that all three
MAFFT
executable files, called asmafft.bat
, are available in the following directories:RAVENdir/software/mafft/mafft-linux64/mafft.bat
RAVENdir/software/mafft/mafft-mac/mafft.bat
RAVENdir/software/mafft/mafft-win/mafft.bat
-
III-a-v. Remove the three downloaded archives from
RAVENdir/software/mafft
directory. This directory must now contain only the license file and three sub-directories. -
III-a-vi. Update the version number for
MAFFT
inRAVENdir/software/versions.txt
with the one identified in Step III-a-i. -
III-a-vii. Commit the changes through 5 commits: three commits for each operating system, one commit for the license file and one commit for the version change in
RAVENdir/software/versions.txt
.
-
III-a-i. Go to the official
-
III-b. Update
CD-HIT
andHMMER
:-
III-b-i. Go to the official
CD-HIT
andHMMER
websites and identify their corresponding latest versions. -
III-b-ii. Remove all the files from
RAVENdir/software/cd-hit
andRAVENdir/software/hmmer
directories. -
III-b-iii. In macOS, try to install
CD-HIT
andHMMER
usingHomebrew
inTerminal
:brew tap brewsci/bio brew install cd-hit brew install hmmer
Then make sure that the installed versions are the newest ones as identified in Step III-b-i. If they are, the NOTE below in grey can be skipped.
NOTE: If the installed versions are not the newest ones, they must be recompiled manually. Firstly, the binaries can be recompiled in
Terminal
usingHomebrew
by running:brew reinstall cd-hit --build-from-source brew reinstall hmmer --build-from-source
If these compiled versions are still not the latest ones, download the latest version source codes from the official websites manually and do the compiling procedure. Regarding
CD-HIT
, once the source code archive is extracted, go inside the extracted directory usingTerminal
, and then run:make make install
to perform the compilation. Regarding
HMMER
, once the source code archive is extracted, go inside the extracted directory usingTerminal
, and then run:./configure make make install
to perform the compilation.
Since the required versions for
CD-HIT
andMAFFT
are fetched/compiled, locate the required binaries and copy then into RAVEN. Locate and copy the binaries using the following commands:cp `which cd-hit` RAVENdir/software/cd-hit/cd-hit.mac cp `which hmmbuild` RAVENdir/software/hmmer/hmmbuild.mac cp `which hmmsearch` RAVENdir/software/hmmer/hmmsearch.mac
-
III-b-iv. In Unix/Linux, try to install
CD-HIT
andHMMER
usingLinuxbrew
inTerminal
:brew tap brewsci/bio brew install cd-hit brew install hmmer
NOTE: It is strongly recommended to use the latest LTS version of Ubuntu. One may also use cluster, but its relatively old kernel may compromise the binaries in systems using newer kernels. In addition, one may try to install
CD-HIT
andHMMER
usingapt install
, but it is very unlikely that the latest versions are fetched. Then make sure that the installed versions are the newest ones as identified in Step III-b-i. If they are, the NOTE below in grey can be skipped.NOTE: If the installed versions are not the newest ones, they must be recompiled manually. Firstly, the binaries can be recompiled in
Terminal
usingHomebrew
by running:brew reinstall cd-hit --build-from-source brew reinstall hmmer --build-from-source
If these compiled versions are still not the latest ones, download the latest version source codes from the official websites manually and do the compiling procedure. Regarding
CD-HIT
, once the source code archive is extracted, go inside the extracted directory usingTerminal
, and then run:make make install
to perform the compilation. Regarding
HMMER
, once the source code archive is extracted, go inside the extracted directory usingTerminal
, and then run:./configure make make install
to perform the compilation.
Since the required versions for
CD-HIT
andMAFFT
are fetched/compiled, locate the required binaries and copy them into RAVEN. Locate and copy the binaries using the following commands:cp `which cd-hit` RAVENdir/software/cd-hit/cd-hit cp `which hmmbuild` RAVENdir/software/hmmer/hmmbuild cp `which hmmsearch` RAVENdir/software/hmmer/hmmsearch
-
III-b-v. In Windows, the only available option to install
CD-HIT
andHMMER
is through the manual binaries' compilation. We do it inCygwin
(64-bit version). Even ifCygwin
is already installed on the computer, download theCygwin
installer again and select the following packages during installation:- gcc-g++ : - GNU Compiler Collection (c++)
- openssh : the openssh server and client programs
- make : the gnu version of the make utility
- ncurses : terminal display utilities
- clisp : an ANSI common LUSP implementation
- vim minimal : minimal vi-text editor
- zlib-dev : gzip de/compression library (development)
NOTE: Try to install the latest possible versions, but avoid the versions flagged with
-test
, etc.Then download the latest version source codes for
CD-HIT
andHMMER
from the official websites. It is recommended to work right inDownloads
directory. Extract both downloaded archives there. Then create two sub-directories inDownloads
:cdhit_bin
andhmmer_bin
.Open
Cygwin Terminal
and navigate toDownloads
directory. Then go insideCD-HIT source code directory
and compile the code by typingmake
. After the compilation, notice that the new file calledcd-hit.exe
is generated in the existing directory. Copy this file intocdhit_bin
folder. Thereafter, go insideHMMER source code directory
and compile the code by typing./configure
and thenmake
. After the compilation, notice that the binary fileshmmbuild.exe
andhmmsearch.exe
were generated insrc
directory. Copy both files intohmmer_bin
folder.Now it is time to find which
dll
files are required to run these newly createdCD-HIT
andHMMER
binaries. To find this, one should run eachexe
file as long as Windows gives errors about missingdll
files. All the missingdll
files can be found inCygwin
installation directory, e.g.,c:\cygwin64\bin
. The missingdll
files must be copied into the same folder as the correspondingexe
files. Once all the errors regarding the missingdll
files are gone for all three binaryexe
files, the job is completed.NOTE: One can test the executables in
PowerShell
. Just navigate to the directory where binaries are and type e.g.,./cd-hit
.Copy the compiled binaries together with the corresponding
DLLs
toRAVENdir/software/cd-hit
andRAVENdir/software/hmmer
. -
III-b-vi. Download the license files for
CD-HIT
andHMMER
and put them inRAVENdir/software/cd-hit
andRAVENdir/software/hmmer
respectively. -
III-b-vii. Update the version numbers for
CD-HIT
andHMMER
inRAVENdir/software/versions.txt
with the ones identified in Step III-b-i. -
III-b-viii. Commit the changes through 8 commits: six commits for each operating system, one commit for the license files and one commit for the version change in
RAVENdir/software/versions.txt
.
-
-
III-c. Run
checkInstallation
and ensure that no errors are given in all three systems: macOS, Unix/Linux (cluster can be used for testing) and Windows.
-
IV-a. The step where the protein sequences are re-organised from
genes.pep
into KEGG Orthology (KO) specific multi-fasta files requires the additional amount of Java Heap Memory. Increase Java Heap Size in MATLAB by going to Preferences and then finding General -> Java Heap Size option. It is recommended to increase the value to at least 2048 MB. MATLAB restart may be required upon this change.
NOTE: The increased requirement for Java Heap Memory is currently the only restriction that prevents running KEGG database update in the cluster. The cluster may be used to update the KEGG database as soon as one finds how to increase Java Heap Size for MATLAB in the cluster.
- IV-b. Make sure that you do not have COBRA Toolbox in the MATLAB path, to avoid issues where files have identical names.
-
V-a. Go to the RAVEN installation directory, then navigate to
external
andkegg
sub-directories and remove the existing KEGG database from RAVEN by removing all the 5 files inmat
format. -
V-b. Move 6 pre-processed files obtained during Step I (excluding
genes.pep
andtaxonomy
) intokegg
directory. -
V-c. Set
keggUpdate/euk90_keggXX
as the Current Folder in MATLAB. -
V-d. Re-generate
mat
files from the pre-processed KEGG FTP files. This can be done by generating the draft GEM for Saccharomyces cerevisiae. Here we do not provide S. cerevisiae proteome and instead just asking to fetch all the KEGG reactions associated with S. cerevisiae three-letter code:The reconstruction should take around 1 hour to complete. It is okay thatmodel = getKEGGModelForOrganism('sce');
keggPhylDist.mat
is not generated, since it will be produced in the later steps. -
V-e. Create a directory, called
KEGG FTP dump files
insidekeggUpdate
. Then move the 6 pre-processed files (copied during Step V-b) fromkegg
to this directory. To save computer memory, these files can be compressed into azip
archive.
-
VI-a. Go to the RAVEN installation directory, then navigate to
tutorial
and copysce.fa
file tokeggUpdate
directory. -
VI-b. Ensure that the Current Folder in MATLAB is
keggUpdate
. -
VI-c. Generate the new HMM set
euk90_keggXX
. Unlike Step V-d, here we provide S. cerevisiae proteome, so the new HMM set will be generated along the way:Such reconstruction may take up to 5-6 hours. The new fileeuk90=getKEGGModelForOrganism('sce','sce.fa','euk90_keggXX','out_euk90_keggXX',true,true,true,true,10^-50,0.8,0.3,-1,inf,1);
keggPhylDist.mat
should appear inkegg
directory. -
VI-d. Delete the file
genes.pep
fromkeggdb
directory. This file is no longer necessary since the protein sequences are already pre-processed into KO-specific multi-fasta files infasta
directory. -
VI-e. Move the file
taxonomy
fromkeggdb
toKEGG FTP dump files
directory. -
VI-f. Remove all the files from
aligned
directory. -
VI-g. Copy the entire
fasta
folder tokeggUpdate
directory. Then remove all the files fromkeggUpdate /euk90_keggXX/fasta
. -
VI-h. Remove
out_euk90_keggXX
directory. -
VI-i. The newly trained HMM set
euk90_keggXX
is finished. Compress it while saving the archive aseuk90_keggXX.zip
. Make sure that the archive does not contain any macOS-specific files (e.g.,.DS_Store
). -
VI-j. Since the full set of KEGG
mat
files is generated, commit the changes. Once committing, mark only these 5mat
files, nothing else!
-
VII-a. Create the following directories in
keggUpdate
:prok90_keggXX
-
VII-b. Copy
fasta
folder to the above directory. -
VII-c. Compress
keggUpdate/fasta
folder asfasta_keggXX.zip
and move toKEGG FTP dump files
.
-
VIII-a. Ensure that the Current Folder in MATLAB is
keggUpdate
. -
VIII-b. Generate the remaining HMM sets by running the corresponding commands:
prok90=getKEGGModelForOrganism('eco','sce.fa','prok90_keggXX','out_prok90_keggXX',true,true,true,true,10^-50,0.8,0.3,-1,inf,0.9);
-
VIII-c. In the directory, remove all the files from aligned and fasta sub-directories.
-
VIII-d. Remove the following directories:
-
out_prok90_keggXX
VIII-e. The newly trained HMM sets are finished. Compress them to zip format. The following archive files should be created: prok90_keggXX.zip
Make sure that the archives do not contain any macOS-specific files and directories (e.g.,
.DS_Store
,__MACOSX
). -
- IX-a. Pay attention what will be the next RAVEN release number, for instance 2.8.0, to which the new KEGG HMM files will be attached. Each new KEGG release justifies a minor version bump (e.g. from 2.7.12 to 2.8.0).
-
IX-b. Open
getKEGGModelForOrganism.m
file and locate the part wherewebsave
attempts to download the HMM ZIP files from GitHub. Update the version number in that link to the release number decided on in IX-a. This line should now read something similar to:
websave([dataDir,'.zip'],['https://github.com/SysBioChalmers/RAVEN/releases/download/v2.8.0/',hmmOptions{hmmIndex},'.zip']);
-
IX-c. Commit the changes done in
getKEGGModelForOrganism.m
.
-
X-a. Fix all the RAVEN functions which use
getKEGGModelForOrganism
, this includestutorial5
,tutorial6
and likely other functions, which were introduced after RAVEN v2.2.2. Then commit the changes tofix/keggUpdate
. - X-b. Check the Wiki in the RAVEN Toolbox repository and update the pages reflecting the newest KEGG update.
-
XI-a. Create a Pull Request from
fix/keggUpdate
todevel
branch. -
XI-b. Once the above is merged, make a Pull Request from
devel
tomain
branch.
- XII-a. Make a new RAVEN release with the version number decided in IX-a. In the release text, include a mention that the KEGG version has been increased.
-
XII-b. On the same page where details of the new release are entered, attach the two ZIP files (
prok90_keggXX.zip
andeuk90_keggXX.zip
) as binaries to the release. - XII-c. Make the release, the ZIP files can now be downloaded.
- Introduction
- Installation
- External Databases
- Getting Started
- Model Reconstruction from KEGG
- Option 1: Based on KEGG Organism Code
- Option 2: Based on Homology Search Against KEGG Orthology Specific HMMs
- Option 2-a: Use Pre-Trained HMMs
- Option 2-b: de novo Generate HMMs
- Development Policy
- Known Issues
- Developer Protocols