A Snakemake workflow for encrypting files before uploading them to the European Genome-Phenome Archive (EGA).
The workflow will produce EGA compliant encrypted files along with files for the encrypted and unencrypted MD5 checksums for each file in a specified input directory. The resulting output will be placed in a separate specified folder with the same directory structure as the input folder. The actual encryption is done by the EGA-Cryptor Encryption Utility.
The workflow will output three files per input file:
file.gpg
(encrypted file)file.md5
(file md5 sum value file)file.gpg.md5
(encrypted file md5 sum value file)
The EGACryptor v.2.0.0 is a JAVA-based application which enables submitters to produce EGA compliant encrypted files along with files for the encrypted and unencrypted md5sum for each file to be submitted. The application will generate an output folder that will by default mirror the directory structure containing the original files. This output folder can subsequently be uploaded to the EGA FTP staging area via an FTP or Aspera client.
The link to the GitHub repository of Ega-Cryptor is available here, and the Ega website with information about the application is available here.
- Installation
- Linux
- Windows
- macOS
- Download Ega-Cryptor
- Snakemake Environment
- Clone the repository
- Snakemake Usage
- Run the pipeline in a HPC
- Configuration of the pipeline
- Cluster Configuration
Linux Installation
Open a Linux shell, then run these three commands to quickly and quietly download the latest 64-bit Linux miniconda 3 installer, rename it to a shorter file name, silently install, and then delete the installer.
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
After installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
You should see (base)
in the command line prompt. This tells you that you’re in your base conda environment. To learn more about conda environments, see Environments.
Check for a good installation with:
conda --version
# conda 24.X.X
conda list
# outputs a list of packages installed in the current environment (base)
Windows Installation
Since Windows does not have access to the majority of packages we need in the pipeline, we need to install Linux on Windows, also known as WSL. In the Windows Power Shell:
wsl --install
# This command will install the Ubuntu distribution of Linux.
If you run into an issue during the installation process, please check the installation section of the troubleshooting guide.
Once you have installed WSL, you will need to create a user account and password for your newly installed Linux distribution. See the Best practices for setting up a WSL development environment guide to learn more.
Once you have a working shell in your WSL, run these three commands to quickly and quietly download the latest 64-bit Linux miniconda 3 installer, rename it to a shorter file name, silently install, and then delete the installer.
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
After installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
You should see (base)
in the command line prompt. This tells you that you’re in your base conda environment. To learn more about conda environments, see Environments.
Check for a good installation with:
conda --version
# conda 24.X.X
conda list
# outputs a list of packages installed in the current environment (base)
macOS Installation
These four commands download the latest M1 version of the MacOS installer, rename it to a shorter file name, silently install, and then delete the installer:
mkdir -p ~/miniconda3
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -o ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
After installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
You should see (base)
in the command line prompt. This tells you that you’re in your base conda environment. To learn more about conda environments, see Environments.
Check for a good installation with:
conda --version
# conda 24.X.X
conda list
# outputs a list of packages installed in the current environment (base)
This application is available to download in this link. You must unzip it and save it in a directory in your machine or in the cluster.
Now, with miniconda installed in our machine, we can create a new environment with snakemake installed:
conda create -c conda-forge -c bioconda -n snakemake snakemake
In case you are downloading from the Clinic network, you will have trouble with the SLL certificate. To solve any problems, do the following:
-
You can usually get a copy by clicking on the padlock icon in your browser when visiting any https site, then click around to view certificate, and download in PEM format.
-
Then we will point conda to it in our system.
conda config --set ssl_verify <pathToYourFile>.pem
Once installed, we must activate and move into the snakemake environment with:
conda activate snakemake
snakemake --version
# 8.25.X
If at any time we want to exit the environment, we can withconda deactivate
, and to get back in with conda activate snakemake
.
To see the packages we have currently installed in the environment, we can withconda list
.
- Above the list of files, click Code.
- Copy the URL for the repository. To clone the repository using HTTPS, under "HTTPS", copy the link provided.
- Open a Terminal.
- Change the current working directory to the location where you want the cloned directory. For example,
cd ega_cryptor
. Make sure that the directory exists before you move into it. - Type
git clone [email protected]:lymphIDIBAPS/Ega_Cryptor.git
. - Press Enter to create your local clone.
git clone [email protected]:lymphIDIBAPS/Ega_Cryptor.git
> Cloning into `ega_cryptor`...
> remote: Counting objects: 10, done.
> remote: Compressing objects: 100% (8/8), done.
> remove: Total 10 (delta 1), reused 10 (delta 1)
> Unpacking objects: 100% (10/10), done.
When we have the cloned repository, we can proced and configure the application. To do this, we have a config/config.yaml file that we can edit.
The simple rulegraph for our pipeline at date 14/11 is the following:
In order to run the application, you must execute:
# For a test run of the pipeline
snakemake --use-conda -np
# For a real run of the pipeline
snakemake --use-conda
If we have many files to encrypt and our computer does not have enough computational power, we can run the application in a cluster. This pipeline has been prepared to run in the StarLife cluster, in the BSC.
- Make a new directory named /slgpfs/ in your computer and mount it to the same directory in StarLife:
mkdir /home/user/slgpfs
sshfs -o allow_other [email protected]:/slgpfs/ /home/user/slgpfs/
This will allow you to see and work on the custer from your computer system directly.
-
On your computer, navigate to the directory: /home/user/slgpfs/projects/group_folder
-
Download and extract the following file to the directory, in which we have a full conda environment ready to run snakemake: Snakemake Conda Environment
-
Clone this repository in the directory, following the steps from Clone the repository
-
Now, connect to the cluster:
ssh [email protected] # or
ssh [email protected]
-
In the cluster, navigate to the cloned repository: /slgpfs/projects/group_folder/ega_cryptor
-
Now, activate the snakemake_bsc environment:
source ../snakemake_bsc/bin/activate
In your terminal, you should now see something like: (snakemake_bsc) your_username@sllogin1
- Now, you can run the pipeline from the cluster with the command:
# For a test run of the pipeline
snakemake --profile config/slurm/ --use-envmodules -np
# For a real run of the pipeline
snakemake --profile config/slurm/ --use-envmodules
This command above will run the pipeline with the pipeline configuration from the file located in /slgpfs/projects/group_folder/ega_cryptor/config/config.yaml. Be sure to check and modify the configuration file to alter the pipeline with your desired options.
The cluster configuration file is located in /slgpfs/projects/group_folder/ega_cryptor/config/slurm/config.yaml. Below you have all the options available to customize your cluster run.
-
path_to_ega: path to the Ega-Cryptor .jar file in your machine, downloaded in the Download Ega-Cryptor section of this readme.
-
input: the directory where our files to encrypt are located
-
output: where our encrypted files will be deposited in
-
resources: define the use of resources of the application, can be full, medium or low. Full is all the threads on the machine, medium uses 75% and low uses only 50%.
Remember to check the files in /config/slurm/config.yaml for the cluster configuration. Review all the items and in case something is not clear you can check in this website what each term means in the configuration.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Developed by @obaeza16, using an application developed by the European Genome-phenome Archive (EGA).
Mantained by Lymphoid neoplasms program, IDIBAPS.