Skip to content

mcaitlinv/mega-non-model-wgs-snakeflow

 
 

Repository files navigation

mega-non-model-wgs-snakeflow

Quick install and run

If you would like to put this on your system and test it running on a single node (more later about using SLURM for deployment across multiple nodes) you have to clone this repository and then download the pseudo-genome used for the included test data set (in .test).

You must have Snakemake (version > 6.0) in the active environment.

In short, here are the steps to install and run the .test.

# clone the repo
git clone [email protected]:eriqande/mega-non-model-wgs-snakeflow.git

# download the tarball with the genome in it and then move that
# into resources/
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1LMK-DCkH1RKFAWTR2OKEJ_K9VOjJIZ1b' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1LMK-DCkH1RKFAWTR2OKEJ_K9VOjJIZ1b" -O non-model-wgs-example-data.tar && rm -rf /tmp/cookies.txt

# untar the tarball
tar -xvf non-model-wgs-example-data.tar

# copy the genome from the extracted tarball into mega-non-model-wgs-snakeflow/resources/
cp non-model-wgs-example-data/resources/genome.fasta mega-non-model-wgs-snakeflow/resources/

Once that is set up, you can do a dry run like:

conda activate snakemake
cd mega-non-model-wgs-snakeflow

# set the number of cores you have access to, to use in the
# following command.  Here I have 12.  You should set yours
# however is appropriate
CORES=12
snakemake --cores $CORES --use-conda --conda-frontend mamba -np

If that gives you a reasonable looking output (165 total jobs, lots of conda environments to be installed, etc.) then take the -np off the end of the command to actually run it:

snakemake --cores $CORES --use-conda --conda-frontend mamba

Installing all the conda packages could take a while (2–30 minutes, depending on your system). Once that was done, running all the steps in the workflow on this small data set required less than 4 minutes on 12 cores of a single node from UC Boulder’s SUMMIT supercomputer.

Condensed DAG for the workflow

Here is a DAG for the workflow on the test data in .test, condensed into an easier-to-look-at picture by the condense_dag() function in Eric’s SnakemakeDagR package.

What the user must do and values to be set, etc

  • Choose an Illuminaclip adapter fasta (in config)

Assumptions

  • Paired end

Things fixed or added relative to JK’s snakemake workflow

  • fastqc on both reads
  • don’t bother with single end
  • add adapters so illumina clip can work
  • benchmark each rule
  • use genomicsDBimport
  • allow for merging of lots of small scaffolds into genomicsDB

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.2%
  • R 5.8%