Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine steps with pipes to reduce disk footprint and make the pipeline faster #78

Closed
muffato opened this issue Jul 2, 2023 · 3 comments · Fixed by #91
Closed

Combine steps with pipes to reduce disk footprint and make the pipeline faster #78

muffato opened this issue Jul 2, 2023 · 3 comments · Fixed by #91
Assignees
Labels
enhancement Improvement of the existing features
Milestone

Comments

@muffato
Copy link
Member

muffato commented Jul 2, 2023

Description of feature

The pipeline has a massive disk footprint (approx. 1.5 TB per Gbp of sequence) that is mostly caused by each step storing its output file, e.g. in the samtools sub-workflows.

We can often combine consecutive steps with by piping the output of one step into the next one, thus removing the need to write a (often large) file on disk. This not only reduces the disk footprint, it can also make the pipeline much faster, especially when the filesystem is being heavily used.

@muffato muffato added enhancement Improvement of the existing features maintain Tasks to keep pipelines up to date labels Jul 2, 2023
@priyanka-surana
Copy link
Contributor

It would involve going to something like nf-core/modules#3310. In fact this would be my step 1.

@muffato
Copy link
Member Author

muffato commented Jul 18, 2023

I know you know the command already, but I just found it explained on the samtools website too: https://www.htslib.org/algorithms/duplicate.html#workflow

@muffato muffato self-assigned this Nov 6, 2023
@muffato muffato moved this from Todo to In Progress in Genome After Party Nov 6, 2023
@muffato muffato mentioned this issue Nov 6, 2023
9 tasks
@muffato
Copy link
Member Author

muffato commented Dec 8, 2023

"samtools sormadup" is being implemented in #82 but to close this ticket we should review the whole pipeline.

@muffato muffato added this to the 1.3.0 milestone Feb 12, 2024
@muffato muffato removed the enhancement Improvement of the existing features label Jun 1, 2024
@muffato muffato mentioned this issue Jun 6, 2024
9 tasks
@muffato muffato linked a pull request Jun 6, 2024 that will close this issue
9 tasks
@muffato muffato added enhancement Improvement of the existing features and removed maintain Tasks to keep pipelines up to date labels Jun 17, 2024
@muffato muffato closed this as completed Jun 24, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Genome After Party Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of the existing features
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants