-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lesson-Contain.qmd
220 lines (164 loc) · 9.58 KB
/
Lesson-Contain.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
title: "Biostat 823 - Containerization"
author: "Hilmar Lapp"
institute: "Duke University, Department of Biostatistics & Bioinformatics"
date: "Oct 10, 2024"
format:
revealjs:
slide-number: true
editor: visual
knitr:
opts_chunk:
echo: TRUE
---
## Challenges to computational reproducibility
Reproducibility of computational research faces four major challenges^[Boettiger, C. (2015). An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review, 49(1), 71–79. <https://doi.org/10.1145/2723872.2723882>]:
* Dependency Hell
* Imprecise documentation
* Code rot
* High barriers to adoption for solutions prior to containerization
## "Dependency Hell"
* Software dependencies have themselves dependencies, recursively
* Dependencies can be be often difficult to install (require compilation, manual "tweaks" due local OS or other differences, etc)
* Required version may conflict with that required by other software, or may not work with the local OS version, making it impossible to install.
- The likelihood of conflicts is particularly high on shared computing environments.
## Imprecise documentation
* Research grade software often lacks full documentation on how to install and run it
- The resulting barriers can be time consuming,
- or even unsurmountable, especially for those unfamiliar with the domain or software.
## Code rot
* Dependencies often continue to be developed further
- Resulting changes in behavior or input/output formats can be breaking changes due to violating expectations.
- Behavior and other breaking changes can be the result of bug fixes.
- This can happen anywhere in the dependency chain.
* Dependencies can also become unmaintained or end-of-life
- Can result in removal from package repositories.
- Python 2.x example; CRAN package removal by policy
## Virtual Machine as solution?
* Pros: Creates a full computational environment with everything pre-installed, addressing Dependency Hell, Imprecise Documentation, and Code Rot issues.
* Cons:
- "Heavyweight": huge in size, only few can run on a host
- Therefore, very limited for reuse through composing
- Effectively "black boxes" in the absence of fully automated build definition
- Technologies for automated and declarative build definitions very difficult to adopt for domain scientists
## Containers are lightweight
![Virtual Machines vs Containers (from [Docker vs Virtual Machines (VMs) : A Practical Guide to Docker Containers and VMs](https://www.weave.works/blog/a-practical-guide-to-choosing-between-docker-containers-and-vms))](images/VM vs container.jpeg)
## A brief history
- 1979: [chroot](https://en.wikipedia.org/wiki/Chroot) on Unix V7
- 2000: [jail](https://en.wikipedia.org/wiki/FreeBSD_jail) command on FreeBSD
- 2002: [Linux namespaces](https://en.wikipedia.org/wiki/Linux_namespaces)
- 2007 (v1) and 2013--16 (v2): Linux control groups ([cgroups](https://en.wikipedia.org/wiki/Cgroups))
- 2008: [Linux Containers (LXC)](https://en.wikipedia.org/wiki/LXC)
- 2013: [Docker](https://www.docker.com)
- 2015: [Singularity](https://en.wikipedia.org/wiki/Singularity_(software)) and ([since 2021](https://apptainer.org/news/community-announcement-20211130/)) [Apptainer](https://apptainer.org)
::: aside
There are many [OS-level virtualization](https://en.wikipedia.org/wiki/OS-level_virtualization) systems. LXC, and especially *Docker*, and *Apptainer/Singularity* are by far the most important ones.
:::
## Properties of containerized processes {.smaller}
* On Linux, containers use the host's kernel and CPU
- No hardware emulator, hypervisor, or guest OS
- Container engine uses Linux' native virtualization capabilities
- Processes within container are visible by host kernel
- Containers are portable within limits determined by kernel version dependencies
- Hosts can run 100s of containers
![](images/docker-for-mac.png){fig-align="right" width="30%" style="float: right"}
<br/><br/>
* On Windows and macOS, requires a Linux VM
- Part of the Docker installation (uses [WSL/WSL2](https://learn.microsoft.com/en-us/windows/wsl/about) on Windows; LinuxKit / Hypervisor Framework on macOS)
- Apptainer can [use WSL/WSL2 on Windows](https://apptainer.org/docs/admin/main/installation.html#windows), with access to GPUs; [on macOS](https://apptainer.org/docs/admin/main/installation.html#mac), requires [Lima](https://lima-vm.io) as VM host (no GPU)
::: aside
Figure modified from [Gianluca Quercini, Cloud computing -- Docker Primer](https://gquercini.github.io/courses/cloud-computing/references/docker-primer/)
:::
## Container terminology
<img src="images/docker-architecture.png" width="80%"/>
::: aside
From [ELIXIR containers nextflow: Docker](https://biocorecrg.github.io/ELIXIR_containers_nextflow/docker.html)
:::
## Apptainer / Singularity: Containers for HPC {.smaller}
* HPC systems are shared computing environments
- Docker daemon runs as root, processes within container can run as root
- Not permissible on a shared computing environment
* Apptainer does not require elevated privileges
- Launcher run by user, not a daemon run by root
- Processes inside container run as same user as outside
* Apptainer containers can be built (bootstrapped) from (many) Docker container images
- Most scientific software containers are compatible
- Issues can occur for containers that run services under a privileged user (httpd, database server, etc)
## Apptainer / Singularity vs Docker
![](images/singularity_architecture.png)
::: aside
From [ELIXIR containers nextflow: Introduction to Singularity](https://biocorecrg.github.io/ELIXIR_containers_nextflow/singularity.html)
:::
## Containerization re: Reproducibility
* Dependency Hell:
- all dependencies are pre-built into the container image
* Imprecise Documentation:
- Container definition includes full installation commands
- Simple text file, lends itself to version control
* Code rot
- Installation can dictate exact versions
- Image spec. can include version tag ("latest" is default)
- Can archive container image for perpetuity
## Low barrier to adoption
Container definition files are simple text files:
```docker
FROM ubuntu:20.04
# formerly LABEL maintainer="[email protected]"
LABEL org.opencontainers.image.authors="[email protected]"
# picard requires java
RUN apt-get update && apt-get install -y \
wget \
openjdk-8-jre-headless
# Installs fastqc from compiled java distribution into /opt/FastQC
ENV PICARD_VERSION="2.10.7"
ENV PICARD_URL https://github.com/broadinstitute/picard/releases/download/${PICARD_VERSION}/picard.jar
WORKDIR /opt/picard
RUN wget $PICARD_URL
CMD ["java", "-jar", "picard.jar"]
```
## Low barrier to reuse
* As simple text files, container definition files can be easily shared, collaboratively developed, and maintained using version control.
* For sharing ready-to-use container images, a number of registries exist, including
- [Docker Hub](https://hub.docker.com/)
- [Quay.io](https://quay.io)
- [GitHub Packages](https://ghcr.io) Repository (includes container images)
- Gitlab container registry (gitlab-registry.oit.duke.edu for Duke OIT's Gitlab installation)
## (Note) Multi-stage builds
* Build layers are read-only
- Deleting files from a preceding layer will not delete them from the image
* Multi-stage builds^[References for [Docker](https://docs.docker.com/build/building/multi-stage/) and [Singularity](https://docs.sylabs.io/guides/latest/user-guide/definition_files.html#multi-stage-builds)]
- Multiple container builds in one container definition
- Use to retain build products but not the software environment needed to create them (which can be large)
## (Note) Build docker, run apptainer
* Building Docker container images typically more flexible
- No Apptainer Desktop version for Windows or macOS (requires Linux VM instead)
- Container build instructions may cause problems with `apptainer build` in unprivileged environment (which uses `--fakeroot` by default)
* Apptainer can use (most) Docker images directly
- Can download and run in one step:
```shell
$ apptainer run docker://<docker_url> <cmd>
```
## (Note) Mounting data into container {.smaller}
Requires bind mount at container runtime (`docker run`):
* [Docker](https://docs.docker.com/engine/storage/bind-mounts/):
`--volume <local-path>:<container-path>`
or
`--mount type=bind,source=<local-path>,target=<container-path>`
Using `--mount` generates an error if target directory (or file) doesn't exist
* [Apptainer](https://apptainer.org/docs/user/main/bind_paths_and_mounts.html#user-defined-bind-paths):
`--bind <local-path>:<container-path>`
Or use `--mount` (see above).
* Can be used for directories and files
## Resources (I)
* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
* [Docker command line reference](https://docs.docker.com/engine/reference/commandline/cli/)
* [Apptainer file reference](https://apptainer.org/docs/user/main/definition_files.html)
* [Apptainer command line reference](https://apptainer.org/docs/user/main/cli.html)
* [Open Containers Initiative (OCI) standard for annotations](https://specs.opencontainers.org/image-spec/annotations/)
## Resources (II)
* [Intro to Docker Workshop](https://imageomics.github.io/docker-workshop/) (Based on Carpentries Incubator lesson)
* [Into to Singularity Workshop](https://carpentries-incubator.github.io/singularity-introduction/) (Carpentries Incubator lesson)
* [DCC OnDemand](https://dcc-ondemand-01.oit.duke.edu/)
* [Jupyter Docker Stacks](https://jupyter-docker-stacks.readthedocs.io/)
- Customized [Biostat Jupyter Docker container](https://github.com/Duke-GCB/biostat-jupyter)