Skip to content

KEGG Update in RAVEN

Eduard Kerkhoven edited this page Mar 8, 2023 · 12 revisions

The Developer’s Protocol for Updating KEGG and External Software

This is the step-by-step protocol for developers to update the KEGG database which comes bundled with The RAVEN Toolbox 2. Since the database is updated from KEGG FTP dump files, it is necessary to have KEGG FTP Subscription. The protocol includes a detailed overview of how to generate a new KEGG database and KEGG Orthology (KO) specific Hidden Markov Model (HMM) sets from KEGG FTP content. In addition, the key KEGG FTP dump files are saved for later use, so the user can always completely re-run the full KEGG update for RAVEN using these files. This is helpful once the major inconsistencies are found in KEGG mat files or HMM sets, just in case.

NOTE: The single users who only aim to reconstruct a model from KEGG FTP dump files, should use this guide instead.

NOTE: It is unlikely that a single user has an access to all three systems (macOS, Unix/Linux, Windows), so some collaboration may be necessary between multiple users.


Step I. Download and Pre-Process KEGG FTP Files:

  • I-a. Go to https://www.genome.jp/kegg/docs/relnote.html and find the latest KEGG Release number.
  • I-b. Create the following nested structure of directories somewhere in the personal Box Sync: keggUpdate/euk90_keggXX/keggdb. In this protocol, XX indicates the release number. Use the number identified in Step I-a here. It is sufficient to include the major release number, e.g., 88.
  • I-c. Log in to KEGG FTP via ftp.bioinformatics.jp using SysBio or personal credentials. It is strongly recommended to use FTP client software like FileZilla as several files to be downloaded are large.
  • I-d. Download all the source files to keggdb directory from the following locations in KEGG FTP:
    • /kegg/ligand/reaction.tar.gz
    • /kegg/ligand/compound.tar.gz
    • /kegg/ligand/glycan.tar.gz
    • /kegg/genes/ko.tar.gz
    • /kegg/genes/fasta/eukaryotes.pep.gz
    • /kegg/genes/fasta/prokaryotes.pep.gz
    • /kegg/genes/misc/taxonomy
  • I-e. Extract all the downloaded archives. Only the specific files from each archive are required, they must be placed right in keggdb directory. Such step including the clean-up of temporary files can be accomplished using Terminal (in Windows, use Cygwin Terminal instead):
    for f in *.tar.gz; do tar xf "$f" && rm "$f"; done
    gunzip *.gz
    mv reaction/reaction reaction_raw
    mv reaction/reaction.lst reaction.lst
    mv reaction/reaction_mapformula.lst reaction_mapformula.lst
    mv compound/compound compound_raw
    mv compound/compound.inchi compound.inchi
    mv glycan/glycan glycan_raw
    mv ko/ko ko_raw
    rm -rf reaction && rm -rf compound && rm -rf glycan && rm -rf ko
    mv reaction_raw reaction && mv ko_raw ko
    
  • I-f. Concatenate compound and multi-fasta protein files:
    cat compound_raw glycan_raw > compound
    cat eukaryotes.pep prokaryotes.pep > genes.pep
    rm *_raw && rm *tes.pep
    
  • I-g. Ensure that keggdb folder contains only the following 8 files:
    • compound
    • compound.inchi
    • genes.pep
    • ko
    • reaction
    • reaction_mapformula.lst
    • reaction.lst
    • taxonomy

Step II. Install the RAVEN Toolbox:

  • II-a. Install the latest RAVEN version and its dependencies using the instructions online. Download RAVEN to the local machine using GitHub Desktop, GitKraken or simply Git in Terminal.
  • II-b. Make sure, that the writing permissions in RAVEN repository are allowed for GitHub user, who is going to commit the changes and create a Pull Request.
  • II-c. In the local RAVEN repository, change the current branch from main to devel.
  • II-d. Create a new branch based on devel and name it e.g., fix/keggUpdate. Then change the current branch to this newly created branch.
  • II-e. Run checkInstallation to ensure that RAVEN is working fine.

Step III. Compile and Update the External Binaries in RAVEN:

NOTE: During this step, we only consider updating the external binaries required in KEGG-based GEM reconstruction: MAFFT, CD-HIT and HMMER. While the update is super simple for MAFFT, it takes additional efforts and time to prepare the updates for CD-HIT and HMMER. The latter is complicated due to the need to prepare the binaries for all three systems (macOS, Unix/Linux and Windows).

  • III-a. Update MAFFT: As it is with version 7.427, the developer distributes the standalone all-in-one packages for macOS, Unix and Windows systems:

    • III-a-i. Go to the official MAFFT website and identify the latest available version, which is simultaneously available as all-in-one package for macOS, Unix/Linux and Windows. Regarding Unix/Linux, do not consider rpm, deb versions, but search for Portable package instead. If the newest version is not available for all three systems, use the older version which is available for all three systems. For instance, in August 2019 the newest MAFFT version for Windows was 7.429, but since it was not available for the other two systems, the older version 7.427, which was available for all three systems, was considered.
    • III-a-ii. Remove all the files from RAVENdir/software/mafft directory.
    • III-a-iii. Download the three MAFFT all-in-one/portable packages identified in Step III-a-i. Also, download the license file (e.g., version 7.427 has a BSD license). Place all the download files in RAVENdir/software/mafft directory.
    • III-a-iv. Extract all three downloaded archives. Ensure that all three MAFFT executable files, called as mafft.bat, are available in the following directories:
      • RAVENdir/software/mafft/mafft-linux64/mafft.bat
      • RAVENdir/software/mafft/mafft-mac/mafft.bat
      • RAVENdir/software/mafft/mafft-win/mafft.bat
    • III-a-v. Remove the three downloaded archives from RAVENdir/software/mafft directory. This directory must now contain only the license file and three sub-directories.
    • III-a-vi. Update the version number for MAFFT in RAVENdir/software/versions.txt with the one identified in Step III-a-i.
    • III-a-vii. Commit the changes through 5 commits: three commits for each operating system, one commit for the license file and one commit for the version change in RAVENdir/software/versions.txt.
  • III-b. Update CD-HIT and HMMER:

    • III-b-i. Go to the official CD-HIT and HMMER websites and identify their corresponding latest versions.

    • III-b-ii. Remove all the files from RAVENdir/software/cd-hit and RAVENdir/software/hmmer directories.

    • III-b-iii. In macOS, try to install CD-HIT and HMMER using Homebrew in Terminal:

      brew tap brewsci/bio
      brew install cd-hit
      brew install hmmer
      

      Then make sure that the installed versions are the newest ones as identified in Step III-b-i. If they are, the NOTE below in grey can be skipped.

      NOTE: If the installed versions are not the newest ones, they must be recompiled manually. Firstly, the binaries can be recompiled in Terminal using Homebrew by running:

      brew reinstall cd-hit --build-from-source
      brew reinstall hmmer --build-from-source
      

      If these compiled versions are still not the latest ones, download the latest version source codes from the official websites manually and do the compiling procedure. Regarding CD-HIT, once the source code archive is extracted, go inside the extracted directory using Terminal, and then run:

      make
      make install
      

      to perform the compilation. Regarding HMMER, once the source code archive is extracted, go inside the extracted directory using Terminal, and then run:

      ./configure
      make
      make install
      

      to perform the compilation.

      Since the required versions for CD-HIT and MAFFT are fetched/compiled, locate the required binaries and copy then into RAVEN. Locate and copy the binaries using the following commands:

      cp `which cd-hit` RAVENdir/software/cd-hit/cd-hit.mac
      cp `which hmmbuild` RAVENdir/software/hmmer/hmmbuild.mac
      cp `which hmmsearch` RAVENdir/software/hmmer/hmmsearch.mac
      
    • III-b-iv. In Unix/Linux, try to install CD-HIT and HMMER using Linuxbrew in Terminal:

      brew tap brewsci/bio
      brew install cd-hit
      brew install hmmer
      

      NOTE: It is strongly recommended to use the latest LTS version of Ubuntu. One may also use cluster, but its relatively old kernel may compromise the binaries in systems using newer kernels. In addition, one may try to install CD-HIT and HMMER using apt install, but it is very unlikely that the latest versions are fetched. Then make sure that the installed versions are the newest ones as identified in Step III-b-i. If they are, the NOTE below in grey can be skipped.

      NOTE: If the installed versions are not the newest ones, they must be recompiled manually. Firstly, the binaries can be recompiled in Terminal using Homebrew by running:

      brew reinstall cd-hit --build-from-source
      brew reinstall hmmer --build-from-source
      

      If these compiled versions are still not the latest ones, download the latest version source codes from the official websites manually and do the compiling procedure. Regarding CD-HIT, once the source code archive is extracted, go inside the extracted directory using Terminal, and then run:

      make
      make install
      

      to perform the compilation. Regarding HMMER, once the source code archive is extracted, go inside the extracted directory using Terminal, and then run:

      ./configure
      make
      make install
      

      to perform the compilation.

      Since the required versions for CD-HIT and MAFFT are fetched/compiled, locate the required binaries and copy them into RAVEN. Locate and copy the binaries using the following commands:

      cp `which cd-hit` RAVENdir/software/cd-hit/cd-hit
      cp `which hmmbuild` RAVENdir/software/hmmer/hmmbuild
      cp `which hmmsearch` RAVENdir/software/hmmer/hmmsearch
      
    • III-b-v. In Windows, the only available option to install CD-HIT and HMMER is through the manual binaries' compilation. We do it in Cygwin (64-bit version). Even if Cygwin is already installed on the computer, download the Cygwin installer again and select the following packages during installation:

      • gcc-g++ : - GNU Compiler Collection (c++)
      • openssh : the openssh server and client programs
      • make : the gnu version of the make utility
      • ncurses : terminal display utilities
      • clisp : an ANSI common LUSP implementation
      • vim minimal : minimal vi-text editor
      • zlib-dev : gzip de/compression library (development)

      NOTE: Try to install the latest possible versions, but avoid the versions flagged with -test, etc.

      Then download the latest version source codes for CD-HIT and HMMER from the official websites. It is recommended to work right in Downloads directory. Extract both downloaded archives there. Then create two sub-directories in Downloads: cdhit_bin and hmmer_bin.

      Open Cygwin Terminal and navigate to Downloads directory. Then go inside CD-HIT source code directory and compile the code by typing make. After the compilation, notice that the new file called cd-hit.exe is generated in the existing directory. Copy this file into cdhit_bin folder. Thereafter, go inside HMMER source code directory and compile the code by typing ./configure and then make. After the compilation, notice that the binary files hmmbuild.exe and hmmsearch.exe were generated in src directory. Copy both files into hmmer_bin folder.

      Now it is time to find which dll files are required to run these newly created CD-HIT and HMMER binaries. To find this, one should run each exe file as long as Windows gives errors about missing dll files. All the missing dll files can be found in Cygwin installation directory, e.g., c:\cygwin64\bin. The missing dll files must be copied into the same folder as the corresponding exe files. Once all the errors regarding the missing dll files are gone for all three binary exe files, the job is completed.

      NOTE: One can test the executables in PowerShell. Just navigate to the directory where binaries are and type e.g., ./cd-hit.

      Copy the compiled binaries together with the corresponding DLLs to RAVENdir/software/cd-hit and RAVENdir/software/hmmer.

    • III-b-vi. Download the license files for CD-HIT and HMMER and put them in RAVENdir/software/cd-hit and RAVENdir/software/hmmer respectively.

    • III-b-vii. Update the version numbers for CD-HIT and HMMER in RAVENdir/software/versions.txt with the ones identified in Step III-b-i.

    • III-b-viii. Commit the changes through 8 commits: six commits for each operating system, one commit for the license files and one commit for the version change in RAVENdir/software/versions.txt.

  • III-c. Run checkInstallation and ensure that no errors are given in all three systems: macOS, Unix/Linux (cluster can be used for testing) and Windows.

Step IV. Configure MATLAB:

  • IV-a. The step where the protein sequences are re-organised from genes.pep into KEGG Orthology (KO) specific multi-fasta files requires the additional amount of Java Heap Memory. Increase Java Heap Size in MATLAB by going to Preferences and then finding General -> Java Heap Size option. It is recommended to increase the value to at least 2048 MB. MATLAB restart may be required upon this change.

NOTE: The increased requirement for Java Heap Memory is currently the only restriction that prevents running KEGG database update in the cluster. The cluster may be used to update the KEGG database as soon as one finds how to increase Java Heap Size for MATLAB in the cluster.

  • IV-b. Make sure that you do not have COBRA Toolbox in the MATLAB path, to avoid issues where files have identical names.

Step V. Update KEGG Database in RAVEN:

  • V-a. Go to the RAVEN installation directory, then navigate to external and kegg sub-directories and remove the existing KEGG database from RAVEN by removing all the 5 files in mat format.
  • V-b. Move 6 pre-processed files obtained during Step I (excluding genes.pep and taxonomy) into kegg directory.
  • V-c. Set keggUpdate/euk90_keggXX as the Current Folder in MATLAB.
  • V-d. Re-generate mat files from the pre-processed KEGG FTP files. This can be done by generating the draft GEM for Saccharomyces cerevisiae. Here we do not provide S. cerevisiae proteome and instead just asking to fetch all the KEGG reactions associated with S. cerevisiae three-letter code:
    model = getKEGGModelForOrganism('sce');
    The reconstruction should take around 1 hour to complete. It is okay that keggPhylDist.mat is not generated, since it will be produced in the later steps.
  • V-e. Create a directory, called KEGG FTP dump files inside keggUpdate. Then move the 6 pre-processed files (copied during Step V-b) from kegg to this directory. To save computer memory, these files can be compressed into a zip archive.

Step VI. Generate the HMM Set euk90_keggXX:

  • VI-a. Go to the RAVEN installation directory, then navigate to tutorial and copy sce.fa file to keggUpdate directory.
  • VI-b. Ensure that the Current Folder in MATLAB is keggUpdate.
  • VI-c. Generate the new HMM set euk90_keggXX. Unlike Step V-d, here we provide S. cerevisiae proteome, so the new HMM set will be generated along the way:
    euk90=getKEGGModelForOrganism('sce','sce.fa','euk90_keggXX','out_euk90_keggXX',true,true,true,true,10^-50,0.8,0.3,-1,inf,1);
    Such reconstruction may take up to 5-6 hours. The new file keggPhylDist.mat should appear in kegg directory.
  • VI-d. Delete the file genes.pep from keggdb directory. This file is no longer necessary since the protein sequences are already pre-processed into KO-specific multi-fasta files in fasta directory.
  • VI-e. Move the file taxonomy from keggdb to KEGG FTP dump files directory.
  • VI-f. Remove all the files from aligned directory.
  • VI-g. Copy the entire fasta folder to keggUpdate directory. Then remove all the files from keggUpdate /euk90_keggXX/fasta.
  • VI-h. Remove out_euk90_keggXX directory.
  • VI-i. The newly trained HMM set euk90_keggXX is finished. Compress it while saving the archive as euk90_keggXX.zip. Make sure that the archive does not contain any macOS-specific files (e.g., .DS_Store).
  • VI-j. Since the full set of KEGG mat files is generated, commit the changes. Once committing, mark only these 5 mat files, nothing else!

Step VII. Prepare for the Remaining HMM Set Generation:

  • VII-a. Create the following directories in keggUpdate:
    • prok90_keggXX
  • VII-b. Copy fasta folder to the above directory.
  • VII-c. Compress keggUpdate/fasta folder as fasta_keggXX.zip and move to KEGG FTP dump files.

Step VIII. Generate the Remaining HMM Sets:

  • VIII-a. Ensure that the Current Folder in MATLAB is keggUpdate.

  • VIII-b. Generate the remaining HMM sets by running the corresponding commands:

    prok90=getKEGGModelForOrganism('eco','sce.fa','prok90_keggXX','out_prok90_keggXX',true,true,true,true,10^-50,0.8,0.3,-1,inf,0.9);
  • VIII-c. In the directory, remove all the files from aligned and fasta sub-directories.

  • VIII-d. Remove the following directories:

    • out_prok90_keggXX VIII-e. The newly trained HMM sets are finished. Compress them to zip format. The following archive files should be created:
    • prok90_keggXX.zip

    Make sure that the archives do not contain any macOS-specific files and directories (e.g., .DS_Store, __MACOSX).

Step IX. Upload the Newly Trained HMM Sets to Box Sync and Add the Download Links to RAVEN:

  • IX-a. Pay attention what will be the next RAVEN release number, for instance 2.8.0, to which the new KEGG HMM files will be attached. Each new KEGG release justifies a minor version bump (e.g. from 2.7.12 to 2.8.0).
  • IX-b. Open getKEGGModelForOrganism.m file and locate the part where websave attempts to download the HMM ZIP files from GitHub. Update the version number in that link to the release number decided on in IX-a. This line should now read something similar to:
websave([dataDir,'.zip'],['https://github.com/SysBioChalmers/RAVEN/releases/download/v2.8.0/',hmmOptions{hmmIndex},'.zip']);
  • IX-c. Commit the changes done in getKEGGModelForOrganism.m.

Step X. Fix the Remaining Parts of RAVEN and RAVEN Wiki:

  • X-a. Fix all the RAVEN functions which use getKEGGModelForOrganism, this includes tutorial5, tutorial6 and likely other functions, which were introduced after RAVEN v2.2.2. Then commit the changes to fix/keggUpdate.
  • X-b. Check the Wiki in the RAVEN Toolbox repository and update the pages reflecting the newest KEGG update.

Step XI. Create a Pull Request to devel and main:

  • XI-a. Create a Pull Request from fix/keggUpdate to devel branch.
  • XI-b. Once the above is merged, make a Pull Request from devel to main branch.

Step XII. Make a new RAVEN release and attached ZIPs:

  • XII-a. Make a new RAVEN release with the version number decided in IX-a. In the release text, include a mention that the KEGG version has been increased.
  • XII-b. On the same page where details of the new release are entered, attach the two ZIP files (prok90_keggXX.zip and euk90_keggXX.zip) as binaries to the release.
  • XII-c. Make the release, the ZIP files can now be downloaded.
Clone this wiki locally