-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Do not merge!] Pseudo PR for first release #8
base: TEMPLATE
Are you sure you want to change the base?
Conversation
Co-authored-by: Fabian Lehmann <[email protected]> Co-authored-by: David Frantz <[email protected]>
(e.g. nf-core github actions)
Important! Template update for nf-core/tools v2.14.1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! Almost there 🤏🏻
I think an option to resolve symlinks and copy the actual files would be good for reproducibility
I have a feeling the modules could rely more on the strengths of nextflow, e.g. many have for loops over files, these should be seperate nextflow jobs imo.
README.md
Outdated
```bash | ||
nextflow run nf-core/rangeland \ | ||
nextflow run nf-core/rangeland/main.nf \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextflow run nf-core/rangeland/main.nf \ | |
nextflow run nf-core/rangeland \ |
bin/merge_boa.r
Outdated
|
||
args = commandArgs(trailingOnly=TRUE) | ||
|
||
|
||
if (length(args) < 3) { | ||
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE) | ||
} | ||
|
||
fout <- args[1] | ||
finp <- args[2:length(args)] | ||
nf <- length(finp) | ||
|
||
require(raster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args = commandArgs(trailingOnly=TRUE) | |
if (length(args) < 3) { | |
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE) | |
} | |
fout <- args[1] | |
finp <- args[2:length(args)] | |
nf <- length(finp) | |
require(raster) | |
require(raster) | |
args = commandArgs(trailingOnly=TRUE) | |
if (length(args) < 3) { | |
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE) | |
} | |
fout <- args[1] | |
finp <- args[2:length(args)] | |
nf <- length(finp) |
at least in genomics it is standard the load the libraries at the beginning of an R script.
bin/merge_boa.r
Outdated
for (i in 1:nf){ | ||
|
||
data <- brick(finp[i])[] | ||
|
||
num <- num + !is.na(data) | ||
|
||
data[is.na(data)] <- 0 | ||
sum <- sum + data | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how large is nf
here usually? for larger nf try to use apply
instead of a for-loop to improve the performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This highly depends on the type, size and overlap of the pipeline's input data. It may become >100 for some extreme cases, but for our currently used data its usually between 5 and 20. The merge scripts are mostly untouched from the previous (not nf-core) installation of this pipeline. I will rework them and also include the other changes you suggested.
bin/merge_qai.r
Outdated
|
||
args = commandArgs(trailingOnly=TRUE) | ||
|
||
|
||
if (length(args) < 3) { | ||
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE) | ||
} | ||
|
||
fout <- args[1] | ||
finp <- args[2:length(args)] | ||
nf <- length(finp) | ||
|
||
require(raster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args = commandArgs(trailingOnly=TRUE) | |
if (length(args) < 3) { | |
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE) | |
} | |
fout <- args[1] | |
finp <- args[2:length(args)] | |
nf <- length(finp) | |
require(raster) | |
require(raster) | |
args = commandArgs(trailingOnly=TRUE) | |
if (length(args) < 3) { | |
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE) | |
} | |
fout <- args[1] | |
finp <- args[2:length(args)] | |
nf <- length(finp) |
@@ -0,0 +1,44 @@ | |||
#!/usr/bin/env Rscript | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a short comment block in the beginning of your custom scripts explaining what they do.
docs/usage.md
Outdated
--resolution '[integer]' | ||
``` | ||
|
||
The default value is 30, as most Landsat satellite natively provide this resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is 30, as most Landsat satellite natively provide this resolution. | |
The default value is `30`, as most Landsat satellite natively provide this resolution. |
docs/usage.md
Outdated
--end_date '[YYYY-MM-DD]' | ||
``` | ||
|
||
Default values are `'1984-01-01'` for the start date and `'2006-12-31'` for the end date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default values are `'1984-01-01'` for the start date and `'2006-12-31'` for the end date. |
we show the default values on the parameters page, so easier to keep the docs in sync, by only having them in one place (if no further explanation of the choice of default values is given)
docs/usage.md
Outdated
|
||
### Group size | ||
|
||
The `group_size` parameters can be ignored in most cases. It defines how many satellite scenes are processed together. The parameters is used to balance the tradeoff between I/O and computational capacities on individual compute nodes. By default, `group_size` is set to 100. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `group_size` parameters can be ignored in most cases. It defines how many satellite scenes are processed together. The parameters is used to balance the tradeoff between I/O and computational capacities on individual compute nodes. By default, `group_size` is set to 100. | |
The `group_size` parameters can be ignored in most cases. It defines how many satellite scenes are processed together. The parameters is used to balance the tradeoff between I/O and computational capacities on individual compute nodes. By default, `group_size` is set to `100`. |
docs/usage.md
Outdated
|
||
### Visualization | ||
|
||
The workflow provides two types of results visualization and aggregation. The fine grained mosaic visualization contains all time series analyses results for all tiles in the original resolution. Pyramid visualizations present a broad overview of the same data but at a lower resolution. Both visualizations can be enabled or disabled using the parameters `mosaic_visualization` and `pyramid_visualization`. By default, both visualization methods are enabled. Note that the mosaic visualization is required to be enabled when using the `test` and `test_full` profiles to allow the pipeline to check the correctness of its results (this is the default behavior, make sure to not disable mosaic when using test profiles) . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow provides two types of results visualization and aggregation. The fine grained mosaic visualization contains all time series analyses results for all tiles in the original resolution. Pyramid visualizations present a broad overview of the same data but at a lower resolution. Both visualizations can be enabled or disabled using the parameters `mosaic_visualization` and `pyramid_visualization`. By default, both visualization methods are enabled. Note that the mosaic visualization is required to be enabled when using the `test` and `test_full` profiles to allow the pipeline to check the correctness of its results (this is the default behavior, make sure to not disable mosaic when using test profiles) . | |
The workflow provides two types of results visualization and aggregation. The fine grained mosaic visualization contains all time series analyses results for all tiles in the original resolution. Pyramid visualizations present a broad overview of the same data but at a lower resolution. Both visualizations can be enabled or disabled using the parameters `mosaic_visualization` and `pyramid_visualization`. By default, both visualization methods are enabled. Note that the mosaic visualization is required to be enabled when using the `test` and `test_full` profiles to allow the pipeline to check the correctness of its results. |
docs/usage.md
Outdated
pyramid_visualization = '[boolean]' | ||
``` | ||
|
||
### FORCE configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section can be removed is task.cpus
is used instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to what @mashehu already said:
- Adding FORCE to bioconda would not only allow for more versatile environment definitions in the pipeline, but would also allow users to install your tool without having to compile it. If you need assistance with that, feel free to reach out to me or the #bioconda channel on slack.
- The pipeline encodes the information that we usually handle via the meta map as directory and file names. This works, but it is less extendable and harder to debug than the meta map, which allows storing an arbitrary number of named meta fields.
But looks already pretty good!
|
||
label 'process_single' | ||
|
||
container "docker.io/davidfrantz/force:3.7.10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible for you to add FORCE to bioconda? This is easier than one would think, I added a module with a similar installation process recently via this PR. This way, we could have all installation modalities (conda, singularity, docker) easily available (as bioconda packages are automatically added to biocontainers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FORCE is not bioinformatics, so is out of scope of bioconda. We are relaxing this requirement for now for non biology pipelines.
nextflow.config
Outdated
apptainer.enabled = false | ||
docker.runOptions = '-u $(id -u):$(id -g)' | ||
docker.enabled = true | ||
docker.userEmulation = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker.userEmulation
is not supported any more in the latest versions of nextflow - I think it is already not supported in 23.04.0, which is the oldest nextflow version that this pipeline is supposed to run on
modules/local/check_results.nf
Outdated
@@ -0,0 +1,41 @@ | |||
nextflow.enable.dsl = 2 | |||
|
|||
process CHECK_RESULTS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This process misses a tag
modules/local/check_results.nf
Outdated
|
||
label 'process_low' | ||
|
||
container 'docker.io/rocker/geospatial:4.3.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, the only package used from the geospatial
image is terra
. The corresponding R package is already on conda-forge, so I guess adding it to Bioconda will be redundant. But we can create images using seqera containers, which gave the following:
- Docker:
community.wave.seqera.io/library/r-terra:1.7-71--57cecb7a052577e0
- Singularity:
oras://community.wave.seqera.io/library/r-terra:1.7-71--bbada5308a9d09c7
|
||
script: | ||
""" | ||
force-tile-extent $aoi tmp/ tile_allow.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be created by path 'tmp/datacube-definition.prj'
(line 11)
ch_versions = ch_versions.mix(FORCE_PREPROCESS.out.versions.first()) | ||
|
||
//Group by tile, date and sensor | ||
boa_tiles = FORCE_PREPROCESS.out.boa_tiles.flatten().map{ [ "${extractDirectory(it)}_${it.simpleName}", it ] }.groupTuple() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In n-core we usually don't encode information as directory/file names, but instead use a meta map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware of that. The reasons we decided to maintain the name-encoded information is that it is the common approach in remote sensing and somewhat expected by FORCE. I will look into switching to meta maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to encode it in file names if it's common/the standard in the field - I would say it's not a blocker for this release.
But if that's the case, I think it would be important to add validation checks to ensure that the file name structure is exactly as expected for the pipeline.
But of course it wouldn't hurt to copy information like that into a meta.map to accompany the files through the pipeline.
…essing results, some minor adjustments in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall really good! You've done a great job at sticking with nf-core structure/guidelines despite the different fields!
A few things I also noticed:
-
Try to stiick to nf-core guidelines for things such as modules structure, even when they are local
-
I would highly recommend adding more validation checks to your inlput nextflow schema:
patterns:
to the nextflow_schema to get better user validation (E.g., file suffix checks; or strings with delimiters - regex is your friend)exists
for all required files
-
Missing a CHANGELOG update, even if it just says 'first release'
-
For the modules with loops inside, I strongly recommend as @mashehu pointed out, to consider parallelising these where you can using Nextflow (or at least with bash), otherwise it's not maximising the benefits of the language
P.S. I vaguely remember me commenting about removing MultiQC from somewhere, please ignore it- I just remembered we need it for software version reporting :)
1. Read satellite imagery, digital elevation model, endmember definition, water vapor database and area of interest definition | ||
2. Generate allow list and analysis mask to determine which pixels from the satellite data can be used | ||
3. Preprocess data to obtain atmospherically corrected images alongside quality assurance information | ||
4. Classify pixels by applying linear spectral unmixing | ||
5. Time series analyses to obtain trends in vegetation dynamics | ||
6. Create mosaic and pyramid visualizations of the results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a requirement, but a diagram would be nice here :) (also helps non-expert reviewers to follow what are meant to be assessing :)
@@ -0,0 +1,44 @@ | |||
#!/usr/bin/env Rscript |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a license and author to all of these scripts.
If you have no preference, just put MIT as the license and point to the pipeline's repo for the file itself.
Example (although style ripped off from @edmundmiller): https://github.com/nf-core/funcscan/blob/2a5c32a4dd72ed18b53a797af7bd8e11694af9e1/bin/ampcombi_download.py#L3-L10
simpler example: https://github.com/nf-core/mag/blob/master/bin/combine_tables.py#L3-L4
conf/modules.config
Outdated
errorStrategy = 'retry' | ||
maxRetries = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally stuff like process execution information goes in base.conf
@mashehu (as the other reviewer), what do you think here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I say this is modules.conf
can be more easily overwritten due to config loading order (and this is OK because file naming/locations are more often customisable by a user), whereas stuff like retrying or maxRetry defaults you probably want secure as the 'fall back' behaviour
publishDir = [ | ||
[ | ||
path: { "${params.outdir}/trend/pyramid/" }, | ||
saveAs: { "${it.substring(12,it.indexOf("."))}/trend/${it.substring(0,11)}/$it" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it has the potential to be a bit brittle (coming from bioinformatics where filenaming can be a wildwest). What is the full it
string? Is the output file name hardcoded or has the potential to be dynamic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file names are standardized by FORCE up to the exact digit positions (e.g. https://force-eo.readthedocs.io/en/latest/components/lower-level/level2/format.html#naming-convention). The name won't dynamically change. This is another example of path/name-encoded information in earth observation.
conf/test.config
Outdated
sensors_level1 = 'LT04,LT05' | ||
sensors_level2 = 'LND04 LND05' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct these have a different delimiter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for mentioning that. We actually don't need the first parameter any more. That's a remnant of a prior version of the workflow where sensors_level1 was used in a cli command to download some input data, hence the different delimiter. I will remove the first parameter.
.toSortedList{ a,b -> a[1][0].simpleName <=> b[1][0].simpleName } | ||
.flatMap{it} | ||
.groupTuple( remainder : true, size : params.group_size ).map{ [ it[0], it[1] .flatten() ] } | ||
qai_tiles_to_merge = qai_tiles.filter{ x -> x[1].size() > 1 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qai_tiles_to_merge = qai_tiles.filter{ x -> x[1].size() > 1 } | |
qai_tiles_to_merge = qai_tiles.filter{ x -> x[1].size() > 1 } |
workflows/rangeland.nf
Outdated
*/ | ||
|
||
|
||
// check wether provided input is within provided time range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// check wether provided input is within provided time range | |
// check whether provided input is within provided time range |
workflows/rangeland.nf
Outdated
cube_file = file( "$params.data_cube" ) | ||
aoi_file = file( "$params.aoi" ) | ||
endmember_file = file( "$params.endmember" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cube_file = file( "$params.data_cube" ) | |
aoi_file = file( "$params.aoi" ) | |
endmember_file = file( "$params.endmember" ) | |
cube_file = file( params.data_cube ) | |
aoi_file = file( params.aoi ) | |
endmember_file = file( params.endmember ) |
workflows/rangeland.nf
Outdated
data = base_path.map(it -> file("$it/*/*", type: 'dir')).flatten() | ||
data = data.flatten().filter{ inRegion(it) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is flatten necessary on both lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'll remove the redundancy.
workflows/rangeland.nf
Outdated
if (params.config_profile_name == 'Test profile') { | ||
woody_change_ref = file("$params.woody_change_ref") | ||
woody_yoc_ref = file("$params.woody_yoc_ref") | ||
herbaceous_change_ref = file("$params.herbaceous_change_ref") | ||
herbaceous_yoc_ref = file("$params.herbaceous_yoc_ref") | ||
peak_change_ref = file("$params.peak_change_ref") | ||
peak_yoc_ref = file("$params.peak_yoc_ref") | ||
|
||
CHECK_RESULTS(grouped_trend_data, woody_change_ref, woody_yoc_ref, herbaceous_change_ref, herbaceous_yoc_ref, peak_change_ref, peak_yoc_ref) | ||
ch_versions = ch_versions.mix(CHECK_RESULTS.out.versions) | ||
} | ||
|
||
if (params.config_profile_name == 'Full test profile') { | ||
UNTAR_REF([[:], params.reference]) | ||
ref_path = UNTAR_REF.out.untar.map(it -> it[1]) | ||
tar_versions.mix(UNTAR_REF.out.versions) | ||
|
||
CHECK_RESULTS_FULL(grouped_trend_data, ref_path) | ||
ch_versions = ch_versions.mix(CHECK_RESULTS_FULL.out.versions) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not embed test specific code within the pipeline itself, (it's not particularly realistic), for this you should add nf-test
to the pipeline and use that for a more structured/standardised approach.
A few pipelines now have this (ampliseq, rnaseq, etc.) but if you need pointers let me know.
Important! Template update for nf-core/tools v3.0.2
Co-authored-by: Matthias Hörtenhuber <[email protected]>
Co-authored-by: Matthias Hörtenhuber <[email protected]>
Adressing review suggestions for Version 1.0.0
Do not merge! This is a PR of dev compared to first release for whole-pipeline reviewing purposes. Changes should be made to
dev
and this PR should not be merged into first-commit-for-pseudo-pr!