Skip to content

Running Bioconductor in Docker using Bulker

Levi Waldron edited this page Apr 28, 2020 · 11 revisions

Preamble

Docker is great because it allows us to share a common environment for development, computation, and other tasks. It means being able to install, use, and develop software with less Undefined symbols for architecture x86_64: _tilde_expand_word..., Configuration failed for package xml2 and other compilation errors that make you want to say

I just can't anymore

For example, the bioconductor/bioconductor_docker Docker images can allow you to run the release and development versions of Bioconductor at any time, and install nearly all Bioconductor packages, without having to install anything other than Docker (and not even libxml2)! The same holds true for many other Bioinformatic tools now available in Docker containers. They just work, without having to install them.

However, Docker has notoriously long and complicated options. For example just to launch an RStudio server using the instructions above involves at a minimum:

docker run \
        -e DISABLE_AUTH=true \
 	-p 8787:8787 \
 	bioconductor/bioconductor_docker:devel

(then you can access RStudio at http://localhost:8787). And you probably also want to mount one or more disk volumes so you can access files from your home directory and make packages you install available even after restarting the container. Then there's other commands for R on the command-line, or running a bash shell. It's a steep enough learning curve that it can be discouraging, even though we've all heard how great the rewards should be. But, the rewards really are great once you get there.

Bulker

To simplify this I am now using bulker.io for everyday Docker use on my laptop and server. It allows me to decide on the Docker images, tags, and options I want to use, to configure these, and then to forget about it and go about work using the same commands I would use if I had installed the software. For example, instead of running the above command I now just run rstudio-server-dev for the development version of Bioconductor or rstudio-server for the release version, or both at the same time. It maps my home directory and user permissions into the container, so I can continue using my home directory (and other mount points) as if I hadn't moved into a Docker container. And I don't have to define a bunch of long aliases; this is what Bulker does.

Installing Bulker

These instructions assume you can run some things on the command line. Bulker has good documentation, but here is a very quick-start for Bioconductor users.

  1. Install bulker by doing (install pip using Homebrew, Linux package manager, instructions from python.org, etc):
pip install --user bulker
  1. Initialize bulker. You might put the export line in your ~/.bash_profile if you're going to continue fiddling in other shells, but this initialization only needs to be done once. I'm assuming you're working from your home directory just to simplify things a bit.
cd
export BULKERCFG="${HOME}/bulker_config.yaml"
bulker init -c $BULKERCFG
  1. Edit your bulker_config.yaml file. Look at my bulker_config.yaml file, and copy the tool_args section (you can skip the curatedmetagenomics part unless you're interested in metagenomics too). You can change where you mount your host package directories by replacing ${HOME}/R/bioc-release and/or ${HOME}/R/bioc-devel.

Note: on our lab server, I made this change to put R libraries in a shared location instead of in my home directory (e.g. --volume=/usr/local/lib/R/site-library:/usr/local/lib/R/host-site-library instead of --volume=${HOME}/R/bioc-release). All others have to do now is edit their .bash_profile like here, including pointing to my value of the environment variable $BULKERCFG, to share the same Docker images and R package directories.

  1. Mac OSX users run an extra script for compatibility, not needed for Linux users. Windows users, I don't know yet. (tip is from here). This will modify your $BULKERCFG file (~/bulker_config.yaml by default).
wget https://github.com/databio/bulker/blob/master/fix_mac_user.sh
chmod 755 fix_mac_user.sh
./fix_mac_user.sh
  1. Install some crates, like my Bioconductor one:
bulker load waldronlab/bioconductor

This should work, but if you want to add / modify what I've mapped to containers, download the source bioconductor.yaml file then load it with bulker load -f bioconductor.yaml.

  1. Initialize the crate. This is a convenient line to add to your ~/.bash_profile so it runs in every shell you open:
bulker activate waldronlab/bioconductor

You're done. You can see what you've done now, for example here is what the command R now invokes on my laptop:

waldronlab/levi| ~$ which R
/Users/lwaldron/bulker_crates/waldronlab/bioconductor/default/R
waldronlab/levi| ~$ cat `which R`
#!/bin/sh

docker run --rm --init \
  -it --volume=/Users/lwaldron/R/bioc-release:/usr/local/lib/R/host-site-library -e DISABLE_AUTH=true -p 8787:8787 -v /Users/lwaldron:/home/rstudio \
  --user=$(id -u):$(id -g) \
  --network="host" \
  --env "DISPLAY" \
  --volume "$HOME:$HOME" \
  --volume="/etc/group:/etc/group:ro" \
  --volume="/Users/lwaldron/templates/mac_passwd:/etc/passwd:ro" \
  --volume="/etc/shadow:/etc/shadow:ro"  \
  --volume="/etc/sudoers.d:/etc/sudoers.d:ro" \
  --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
  --workdir="`pwd`" \
  waldronlab/bioconductor:release R "$@"

Like I said, Docker command-line options can be brutal, and I honestly don't really know what many of those options do. But thanks to Bulker, I don't have to.

  1. Install any other crates that might be useful to you.

Repeat steps 5 and 6 for any other crates that look useful to you, including others at https://github.com/databio/hub.bulker.io/tree/master/databio or custom local "crates" you make. For example, I also use waldronlab/metagenomics and waldronlab/levi. bulker activate accepts a comma-separated list of creates, such as bulker activate waldronlab/bioconductor,waldronlab/levi,waldronlab/metagenomics.

  1. If you don't want bulker to re-initialize your shell, see https://bulker.databio.org/en/latest/tips/.

Using Singularity on an HPC

To setup on the CUNY HPC, I added the following to my .bash_profile to get access to pip, python3, and singularity at shell startup, and to define the $BULKERCFG variable:

module load python/miniconda3
module load singularity
export BULKERCFG="${HOME}/bulker_config.yaml"

Then installed bulker:

pip install --user bulker

Then obeying a warning I got, I also added the following to my ~/.bash_profile:

export PATH=$HOME/.local/bin:$PATH

Then logged out and in again to make sure the above changes took effect, and continued with the above instructions:

[levi.waldron@karle ~]$ bulker init -c $BULKERCFG
Guessing container engine is singularity.
Wrote new configuration file: /scratch/levi.waldron/bulker_config.yaml

I copied some "crates" settings into this file from my bulker_config.yaml file, and then did some bulker load commands (the -b flag pulls the singularity image right away, which gets around a current issue with auto-pulls in bulker and Singularity 3, see issue).

bulker load waldronlab/bioconductor -b
bulker load waldronlab/metagenomics -b
bulker load waldronlab/levi -b

And add to my .bash_profile, as before (obviously only keep the crates you want, maybe just waldronlab/bioconductor):

bulker activate waldronlab/levi,waldronlab/metagenomics,waldronlab/bioconductor

TODO: if sharing this setup among the lab, set up permissions on my $BULKERCFG environment variable and ${HOME}/simages directory so everyone can use the same setup and images.

Conclusions

I'm using the waldronlab/bioconductor Docker containers to run R/Bioconductor, which are based on bioconductor/bioconductor_docker but add the full texlive distribution because I like being able to build most vignettes without hassle. You could change this by loading your own version of waldronlab/bioconductor.yaml above and re-initializing.

If this doesn't work for you, let me know. For many more options and flexibility, see the Bulker documentation.

Using R/Bioconductor in Docker with Bulker

From here it's pretty easy. The first time you run a command, the Docker image will be pulled.

  • R and Rscript work like the command-line versions, including things like R CMD build and R CMD check (the way I've set things up, this is the version of R corresponding to bioc-release). Install packages the regular way (BiocManager::install()), they will persist in ~/R/bioc-release (you can change this default in your bulker_config.yaml). After a major upgrade of R or gcc you'll have to clean this directory and install packages anew, but otherwise until you run into weird troubles about packages having been installed on the wrong R or C compiler version you can just keep updating the packages.
  • Rdev and Rscriptdev for the R version corresponding to bioc-devel. Installed packages go in ~/R/bioc-rdevel.
  • _R or _Rdev to open bash shells in the above two containers. Or in general, prepend _ to any command to open a shell in that Docker container.
  • rstudio-server launches Bioconductor release on http://localhost:8787 (again, edit the bulker_config.yaml if you don't like this choice, if you want to set a password, etc)
  • rstudio-server-dev launches Bioconductor devel on http://localhost:8788. Note that if you run these two rstudio instances at the same time, you may have to access the in different browsers or in "private" browsing windows. You'll get some funny messages about changing R versions when you run release and devel side by side, but these seem harmless.

That's all. I'm also using Bulker for other things (pdftk, emacs, and jekyll, see the waldronlab/levi crate), and it even makes piping between commands run in different Docker containers easy. But I won't get into that here. For using and developing in R and Bioconductor, the above is all I've needed so far.