tags |
---|
ggg, ggg2024, ggg298 |
[toc]
Foo!
Looking at your results; walk through of my approach.
Avoid making people think the first time - "just get it working". "Quickstarts"
A skeleton of working commands is more important than detailed docs, because approximately no one reads the details (including you) until AFTER they get the thing running.
Copy/paste to/from the command line is pretty easy, and then you know the commands work! Also, you can use history
.
Quickstarts and READMEs are really great ways to document things for future you, as well as other people.
(Note implications for automation: the shell is easy to automate things in, in part b/c of ease of copy/paste vs graphical interfaces.)
Use the instructions here. It will be helpful to have an editor available, so RStudio Server is recommended.
For the srun command, use:
srun -p high2 --time=3:00:00 --nodes=1 --cpus-per-task 4 \
--mem 5GB --pty /bin/bash
which asks for 4 CPUs. We'll use these later!
Let's create a new conda environment that has fastqc, sourmash, and snakemake in it. (We might not use snakemake today but it's nice to have it available.)
module load mamba
mamba create -y -n automation \
snakemake-minimal fastqc sourmash
then activate:
mamba activate automation
and let's work in the ggg298-lab-5
subdirectory:
mkdir -p ~/ggg298-lab-5
cd ~/ggg298-lab-5
We've seen so many different things that all look kind of similar:
- current working directory (
cd
) - this is where your files reside, and lets you work with relative file paths. UNIX-like systems generally. - module system (
module load
) - this enables pre-installed software, usually on HPCs. - mamba environment (
mamba activate
) - this dictates what software is available, using the conda/mamba software ecosystem. - Slurm (
srun
) - this reserves compute resources (usually on HPCs).
We will also soon see git
repositories. But I think that's the end :)
Set things up:
cd ~/ggg298-lab-5
cp ~ctbrown/data/sulfo/* .
Then run the sourmash commands:
sourmash sketch dna a.fa.gz --name 'Sulfurihydrogenibium' -o a.sig.zip
sourmash sketch dna b.fa.gz --name 'Sulfitobacter sp. EE-36' -o b.sig.zip
sourmash sketch dna c.fa.gz --name 'Sulfitobacter sp. NAS-14.1' -o c.sig.zip
sourmash compare *.sig.zip -o sulfo.cmp
sourmash plot sulfo.cmp
and let's take a look at the output.
(Digression: What is sourmash doing? Is anyone curious? :)
First, clean up the directory so it's only got the fa.gz files in it:
rm *.sig.zip
rm sulfo.cmp*
Now, let's create a text file in the ~/ggg298-lab5
subdirectory named run-sourmash.sh
and paste the commands in there:
sourmash sketch dna a.fa.gz --name 'Sulfurihydrogenibium' -o a.sig.zip
sourmash sketch dna b.fa.gz --name 'Sulfitobacter sp. EE-36' -o b.sig.zip
sourmash sketch dna c.fa.gz --name 'Sulfitobacter sp. NAS-14.1' -o c.sig.zip
sourmash compare *.sig.zip -o sulfo.cmp
sourmash plot sulfo.cmp
Now, save it, and type at the command line:
bash run-sourmash.sh
What happens??
Welcome to your first shell script!!!
At the top of run-sourmash.sh
, put:
set -e
set -x
and run it again:
bash run-sourmash.sh
Clean up again:
rm *.sig.zip sulfo.cmp*
Now make a new file run-sourmash-2.sh
with the following in it:
set -e
set -x
for genome in *.fa.gz
do
name=$(basename $genome .fa.gz)
sourmash sketch dna $genome --name $name -o $name.sig.zip
done
sourmash compare *.sig.zip -o sulfo.cmp
sourmash plot sulfo.cmp
What's going on?
And how does the output differ now?
What happens when you rerun the shell script? Does it do unnecessary things - things that are already done?
What advantages are there to shell scripts?
- automated / "batch" running
- documentation, sort of! (better than documentation?)