This Snakemake workflow orchestrates the processing of taxonomic data from multiple sources, including BOLD Systems, Fauna Europaea, and expert contributions. It integrates data, performs gap analysis, and maintains updated species lists.
Copyright 2024 Naturalis Biodiversity Center
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
The workflow combines pre-processed taxonomic information from the following sources:
- BOLD
- Fauna Europaea
- Lepiform
- WORMS
- iNaturalist
- input from various experts
At present, these data are expected to simply be there, though the future plan is to do this as part of the overall workflow.
The pipeline consists of four main steps:
- Updating BOLD data through their API
- Combining data from multiple taxonomic sources
- Analyzing coverage gaps
- Generating final updated species lists
snakemake all
Runs the complete pipeline, generating:
- Updated combined species lists
- Gap analysis reports
- Sorted taxonomic hierarchies
snakemake update_bold_data
Queries the BOLD API for current specimen data.
snakemake combine_lists
Integrates data from all taxonomic sources.
snakemake analyze_gaps
Generates gap analysis reports.
snakemake update_final_list
Merges latest BOLD data into combined lists.
snakemake clean
Removes all generated files and logs.
snakemake generate_docs
Creates documentation for all components.
- Conda or Mamba
- Input data in Raw_Data directory
- Create the conda environment:
conda env create -f environment.yml
- Activate the environment:
conda activate BGE-gaplist
Full pipeline:
snakemake --cores all
Dry run to check execution plan:
snakemake -n
Generate workflow DAG:
snakemake --dag | dot -Tsvg > workflow.svg
- BOLD data update: Single thread, ~2GB memory
- Data combination: Single thread, memory varies with input size
- Gap analysis: Multi-thread capable, memory scales with data size
BGE-gaplist/
├── results/
│ ├── Curated_Data/ # Processed data
│ │ ├── updated_combined_lists.csv
│ │ ├── combined_species_lists.csv
│ │ └── {date}_updated_BOLD_data.csv
│ └── Gap_Lists/ # Analysis results
│ ├── Gap_list_all.csv
│ └── sorted/ # Hierarchical results
└── logs/ # Process logs
- All steps log to files in logs/
- Failed steps retain partial outputs for inspection
- Use
--rerun-incomplete
to restart failed jobs
Edit config/config.yaml to modify:
- File paths
- API settings
- Processing parameters
- Fork the repository
- Create a feature branch
- Submit a pull request
File issues on the project's issue tracker.
Fabian Deister - SNSB Rutger Vos - Naturalis
This workflow builds on work from Fabian Deister, SNSB