Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do not merge!] Pseudo PR for first release #8

Open
wants to merge 65 commits into
base: TEMPLATE
Choose a base branch
from
Open

[Do not merge!] Pseudo PR for first release #8

wants to merge 65 commits into from

Conversation

mashehu
Copy link

@mashehu mashehu commented Aug 5, 2024

Do not merge! This is a PR of dev compared to first release for whole-pipeline reviewing purposes. Changes should be made to dev and this PR should not be merged into first-commit-for-pseudo-pr!

Copy link

github-actions bot commented Aug 5, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit caee766

+| ✅ 202 tests passed       |+
#| ❔   2 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

  • nextflow_config - Config manifest.version should end in dev: 1.0.0
  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

  • files_exist - File is ignored: conf/igenomes.config
  • files_exist - File is ignored: conf/igenomes_ignored.config

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-11-26 15:12:05

Copy link
Author

@mashehu mashehu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! Almost there 🤏🏻

I think an option to resolve symlinks and copy the actual files would be good for reproducibility
I have a feeling the modules could rely more on the strengths of nextflow, e.g. many have for loops over files, these should be seperate nextflow jobs imo.

README.md Outdated
```bash
nextflow run nf-core/rangeland \
nextflow run nf-core/rangeland/main.nf \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nextflow run nf-core/rangeland/main.nf \
nextflow run nf-core/rangeland \

bin/merge_boa.r Outdated
Comment on lines 2 to 14

args = commandArgs(trailingOnly=TRUE)


if (length(args) < 3) {
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE)
}

fout <- args[1]
finp <- args[2:length(args)]
nf <- length(finp)

require(raster)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
args = commandArgs(trailingOnly=TRUE)
if (length(args) < 3) {
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE)
}
fout <- args[1]
finp <- args[2:length(args)]
nf <- length(finp)
require(raster)
require(raster)
args = commandArgs(trailingOnly=TRUE)
if (length(args) < 3) {
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE)
}
fout <- args[1]
finp <- args[2:length(args)]
nf <- length(finp)

at least in genomics it is standard the load the libraries at the beginning of an R script.

bin/merge_boa.r Outdated
Comment on lines 25 to 34
for (i in 1:nf){

data <- brick(finp[i])[]

num <- num + !is.na(data)

data[is.na(data)] <- 0
sum <- sum + data

}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how large is nf here usually? for larger nf try to use apply instead of a for-loop to improve the performance

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This highly depends on the type, size and overlap of the pipeline's input data. It may become >100 for some extreme cases, but for our currently used data its usually between 5 and 20. The merge scripts are mostly untouched from the previous (not nf-core) installation of this pipeline. I will rework them and also include the other changes you suggested.

bin/merge_qai.r Outdated
Comment on lines 2 to 14

args = commandArgs(trailingOnly=TRUE)


if (length(args) < 3) {
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE)
}

fout <- args[1]
finp <- args[2:length(args)]
nf <- length(finp)

require(raster)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
args = commandArgs(trailingOnly=TRUE)
if (length(args) < 3) {
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE)
}
fout <- args[1]
finp <- args[2:length(args)]
nf <- length(finp)
require(raster)
require(raster)
args = commandArgs(trailingOnly=TRUE)
if (length(args) < 3) {
stop("\nthis program needs at least 3 inputs\n1: output filename\n2-*: input files", call.=FALSE)
}
fout <- args[1]
finp <- args[2:length(args)]
nf <- length(finp)

@@ -0,0 +1,44 @@
#!/usr/bin/env Rscript

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a short comment block in the beginning of your custom scripts explaining what they do.

docs/usage.md Outdated
--resolution '[integer]'
```

The default value is 30, as most Landsat satellite natively provide this resolution.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The default value is 30, as most Landsat satellite natively provide this resolution.
The default value is `30`, as most Landsat satellite natively provide this resolution.

docs/usage.md Outdated
--end_date '[YYYY-MM-DD]'
```

Default values are `'1984-01-01'` for the start date and `'2006-12-31'` for the end date.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Default values are `'1984-01-01'` for the start date and `'2006-12-31'` for the end date.

we show the default values on the parameters page, so easier to keep the docs in sync, by only having them in one place (if no further explanation of the choice of default values is given)

docs/usage.md Outdated

### Group size

The `group_size` parameters can be ignored in most cases. It defines how many satellite scenes are processed together. The parameters is used to balance the tradeoff between I/O and computational capacities on individual compute nodes. By default, `group_size` is set to 100.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `group_size` parameters can be ignored in most cases. It defines how many satellite scenes are processed together. The parameters is used to balance the tradeoff between I/O and computational capacities on individual compute nodes. By default, `group_size` is set to 100.
The `group_size` parameters can be ignored in most cases. It defines how many satellite scenes are processed together. The parameters is used to balance the tradeoff between I/O and computational capacities on individual compute nodes. By default, `group_size` is set to `100`.

docs/usage.md Outdated

### Visualization

The workflow provides two types of results visualization and aggregation. The fine grained mosaic visualization contains all time series analyses results for all tiles in the original resolution. Pyramid visualizations present a broad overview of the same data but at a lower resolution. Both visualizations can be enabled or disabled using the parameters `mosaic_visualization` and `pyramid_visualization`. By default, both visualization methods are enabled. Note that the mosaic visualization is required to be enabled when using the `test` and `test_full` profiles to allow the pipeline to check the correctness of its results (this is the default behavior, make sure to not disable mosaic when using test profiles) .
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The workflow provides two types of results visualization and aggregation. The fine grained mosaic visualization contains all time series analyses results for all tiles in the original resolution. Pyramid visualizations present a broad overview of the same data but at a lower resolution. Both visualizations can be enabled or disabled using the parameters `mosaic_visualization` and `pyramid_visualization`. By default, both visualization methods are enabled. Note that the mosaic visualization is required to be enabled when using the `test` and `test_full` profiles to allow the pipeline to check the correctness of its results (this is the default behavior, make sure to not disable mosaic when using test profiles) .
The workflow provides two types of results visualization and aggregation. The fine grained mosaic visualization contains all time series analyses results for all tiles in the original resolution. Pyramid visualizations present a broad overview of the same data but at a lower resolution. Both visualizations can be enabled or disabled using the parameters `mosaic_visualization` and `pyramid_visualization`. By default, both visualization methods are enabled. Note that the mosaic visualization is required to be enabled when using the `test` and `test_full` profiles to allow the pipeline to check the correctness of its results.

docs/usage.md Outdated
pyramid_visualization = '[boolean]'
```

### FORCE configuration
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section can be removed is task.cpus is used instead

@mashehu mashehu mentioned this pull request Aug 5, 2024
8 tasks
Copy link

@nictru nictru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what @mashehu already said:

  • Adding FORCE to bioconda would not only allow for more versatile environment definitions in the pipeline, but would also allow users to install your tool without having to compile it. If you need assistance with that, feel free to reach out to me or the #bioconda channel on slack.
  • The pipeline encodes the information that we usually handle via the meta map as directory and file names. This works, but it is less extendable and harder to debug than the meta map, which allows storing an arbitrary number of named meta fields.

But looks already pretty good!


label 'process_single'

container "docker.io/davidfrantz/force:3.7.10"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for you to add FORCE to bioconda? This is easier than one would think, I added a module with a similar installation process recently via this PR. This way, we could have all installation modalities (conda, singularity, docker) easily available (as bioconda packages are automatically added to biocontainers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FORCE is not bioinformatics, so is out of scope of bioconda. We are relaxing this requirement for now for non biology pipelines.

nextflow.config Outdated
apptainer.enabled = false
docker.runOptions = '-u $(id -u):$(id -g)'
docker.enabled = true
docker.userEmulation = true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker.userEmulation is not supported any more in the latest versions of nextflow - I think it is already not supported in 23.04.0, which is the oldest nextflow version that this pipeline is supposed to run on

@@ -0,0 +1,41 @@
nextflow.enable.dsl = 2

process CHECK_RESULTS {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This process misses a tag


label 'process_low'

container 'docker.io/rocker/geospatial:4.3.1'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, the only package used from the geospatial image is terra. The corresponding R package is already on conda-forge, so I guess adding it to Bioconda will be redundant. But we can create images using seqera containers, which gave the following:

  • Docker: community.wave.seqera.io/library/r-terra:1.7-71--57cecb7a052577e0
  • Singularity: oras://community.wave.seqera.io/library/r-terra:1.7-71--bbada5308a9d09c7


script:
"""
force-tile-extent $aoi tmp/ tile_allow.txt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be created by path 'tmp/datacube-definition.prj' (line 11)

ch_versions = ch_versions.mix(FORCE_PREPROCESS.out.versions.first())

//Group by tile, date and sensor
boa_tiles = FORCE_PREPROCESS.out.boa_tiles.flatten().map{ [ "${extractDirectory(it)}_${it.simpleName}", it ] }.groupTuple()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In n-core we usually don't encode information as directory/file names, but instead use a meta map

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of that. The reasons we decided to maintain the name-encoded information is that it is the common approach in remote sensing and somewhat expected by FORCE. I will look into switching to meta maps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to encode it in file names if it's common/the standard in the field - I would say it's not a blocker for this release.

But if that's the case, I think it would be important to add validation checks to ensure that the file name structure is exactly as expected for the pipeline.

But of course it wouldn't hurt to copy information like that into a meta.map to accompany the files through the pipeline.

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall really good! You've done a great job at sticking with nf-core structure/guidelines despite the different fields!

A few things I also noticed:

  • Try to stiick to nf-core guidelines for things such as modules structure, even when they are local

  • I would highly recommend adding more validation checks to your inlput nextflow schema:

    • patterns: to the nextflow_schema to get better user validation (E.g., file suffix checks; or strings with delimiters - regex is your friend)
    • exists for all required files
  • Missing a CHANGELOG update, even if it just says 'first release'

  • For the modules with loops inside, I strongly recommend as @mashehu pointed out, to consider parallelising these where you can using Nextflow (or at least with bash), otherwise it's not maximising the benefits of the language

    P.S. I vaguely remember me commenting about removing MultiQC from somewhere, please ignore it- I just remembered we need it for software version reporting :)

Comment on lines +24 to +29
1. Read satellite imagery, digital elevation model, endmember definition, water vapor database and area of interest definition
2. Generate allow list and analysis mask to determine which pixels from the satellite data can be used
3. Preprocess data to obtain atmospherically corrected images alongside quality assurance information
4. Classify pixels by applying linear spectral unmixing
5. Time series analyses to obtain trends in vegetation dynamics
6. Create mosaic and pyramid visualizations of the results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a requirement, but a diagram would be nice here :) (also helps non-expert reviewers to follow what are meant to be assessing :)

@@ -0,0 +1,44 @@
#!/usr/bin/env Rscript
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a license and author to all of these scripts.

If you have no preference, just put MIT as the license and point to the pipeline's repo for the file itself.

Example (although style ripped off from @edmundmiller): https://github.com/nf-core/funcscan/blob/2a5c32a4dd72ed18b53a797af7bd8e11694af9e1/bin/ampcombi_download.py#L3-L10

simpler example: https://github.com/nf-core/mag/blob/master/bin/combine_tables.py#L3-L4

Comment on lines 38 to 39
errorStrategy = 'retry'
maxRetries = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally stuff like process execution information goes in base.conf @mashehu (as the other reviewer), what do you think here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I say this is modules.conf can be more easily overwritten due to config loading order (and this is OK because file naming/locations are more often customisable by a user), whereas stuff like retrying or maxRetry defaults you probably want secure as the 'fall back' behaviour

publishDir = [
[
path: { "${params.outdir}/trend/pyramid/" },
saveAs: { "${it.substring(12,it.indexOf("."))}/trend/${it.substring(0,11)}/$it" },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it has the potential to be a bit brittle (coming from bioinformatics where filenaming can be a wildwest). What is the full it string? Is the output file name hardcoded or has the potential to be dynamic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file names are standardized by FORCE up to the exact digit positions (e.g. https://force-eo.readthedocs.io/en/latest/components/lower-level/level2/format.html#naming-convention). The name won't dynamically change. This is another example of path/name-encoded information in earth observation.

conf/test.config Outdated
Comment on lines 39 to 40
sensors_level1 = 'LT04,LT05'
sensors_level2 = 'LND04 LND05'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct these have a different delimiter?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for mentioning that. We actually don't need the first parameter any more. That's a remnant of a prior version of the workflow where sensors_level1 was used in a cli command to download some input data, hence the different delimiter. I will remove the first parameter.

.toSortedList{ a,b -> a[1][0].simpleName <=> b[1][0].simpleName }
.flatMap{it}
.groupTuple( remainder : true, size : params.group_size ).map{ [ it[0], it[1] .flatten() ] }
qai_tiles_to_merge = qai_tiles.filter{ x -> x[1].size() > 1 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
qai_tiles_to_merge = qai_tiles.filter{ x -> x[1].size() > 1 }
qai_tiles_to_merge = qai_tiles.filter{ x -> x[1].size() > 1 }

*/


// check wether provided input is within provided time range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// check wether provided input is within provided time range
// check whether provided input is within provided time range

Comment on lines 78 to 80
cube_file = file( "$params.data_cube" )
aoi_file = file( "$params.aoi" )
endmember_file = file( "$params.endmember" )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cube_file = file( "$params.data_cube" )
aoi_file = file( "$params.aoi" )
endmember_file = file( "$params.endmember" )
cube_file = file( params.data_cube )
aoi_file = file( params.aoi )
endmember_file = file( params.endmember )

Comment on lines 90 to 91
data = base_path.map(it -> file("$it/*/*", type: 'dir')).flatten()
data = data.flatten().filter{ inRegion(it) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is flatten necessary on both lines?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'll remove the redundancy.

Comment on lines 135 to 154
if (params.config_profile_name == 'Test profile') {
woody_change_ref = file("$params.woody_change_ref")
woody_yoc_ref = file("$params.woody_yoc_ref")
herbaceous_change_ref = file("$params.herbaceous_change_ref")
herbaceous_yoc_ref = file("$params.herbaceous_yoc_ref")
peak_change_ref = file("$params.peak_change_ref")
peak_yoc_ref = file("$params.peak_yoc_ref")

CHECK_RESULTS(grouped_trend_data, woody_change_ref, woody_yoc_ref, herbaceous_change_ref, herbaceous_yoc_ref, peak_change_ref, peak_yoc_ref)
ch_versions = ch_versions.mix(CHECK_RESULTS.out.versions)
}

if (params.config_profile_name == 'Full test profile') {
UNTAR_REF([[:], params.reference])
ref_path = UNTAR_REF.out.untar.map(it -> it[1])
tar_versions.mix(UNTAR_REF.out.versions)

CHECK_RESULTS_FULL(grouped_trend_data, ref_path)
ch_versions = ch_versions.mix(CHECK_RESULTS_FULL.out.versions)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not embed test specific code within the pipeline itself, (it's not particularly realistic), for this you should add nf-test to the pipeline and use that for a more structured/standardised approach.

A few pipelines now have this (ampliseq, rnaseq, etc.) but if you need pointers let me know.

Felix-Kummer and others added 30 commits September 18, 2024 16:25
Important! Template update for nf-core/tools v3.0.2
Co-authored-by: Matthias Hörtenhuber <[email protected]>
Co-authored-by: Matthias Hörtenhuber <[email protected]>
Adressing review suggestions for Version 1.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants