Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update bcftools pluginsplit #7013

Merged
merged 20 commits into from
Nov 21, 2024

Conversation

LouisLeNezet
Copy link
Contributor

@LouisLeNezet LouisLeNezet commented Nov 18, 2024

PR checklist

Here is a PR to update the BCFTOOLS_PLUGINSPLIT process.

The issue solved here was:

  • The process didn't work in the case of a single sample present in the file with an equivalent output name -> solved by letting the file in the $prefix folder

  • No possibility to add a suffix to the output files generated -> ext.suffix argument added

  • This comment contains a description of changes (with reason).

  • If you've fixed a bug or added code that should be tested, add tests!

  • If you've added a new tool - have you followed the module conventions in the contribution docs

  • If necessary, include test data in your PR.

  • Remove all TODO statements.

  • Emit the versions.yml file.

  • Follow the naming conventions.

  • Follow the parameters requirements.

  • Follow the input/output options guidelines.

  • Add a resource label

  • Use BioConda and BioContainers if possible to fulfil software requirements.

  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:

    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

Copy link
Contributor

@fellen31 fellen31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the first issue, do you have an input file with the same name as one of the samples in the file?

Why do you need to have suffixes?

@LouisLeNezet
Copy link
Contributor Author

I don't fully understand the first issue, do you have an input file with the same name as one of the samples in the file?

This issue happens in our phaseimpute pipeline. In some case the user give a single vcf, that might be named as sample_names.vcf.gz therefore when this vcf is splitted the process as it was would crash as the input name was equivalent to the new one.

Why do you need to have suffixes?

The suffix is used to correctly name the file for multiQC in our pipeline. It also helps to filter out and get back the meta information when you transpose the channel after the split operation.

@fellen31
Copy link
Contributor

Okay, thanks for the clarification.

Could you work with prefix instead of suffix? It seems easier to me and less error prone to add something to the beginning of the name instead of the middle, and prefix is not really used in this module anyway.

@LouisLeNezet
Copy link
Contributor Author

Could you work with prefix instead of suffix? It seems easier to me and less error prone to add something to the beginning of the name instead of the middle, and prefix is not really used in this module anyway.

This might work in my use case, I will try.

@LouisLeNezet
Copy link
Contributor Author

LouisLeNezet commented Nov 18, 2024

Could you work with prefix instead of suffix? It seems easier to me and less error prone to add something to the beginning of the name instead of the middle, and prefix is not really used in this module anyway.

This might work in my use case, I will try.

Unfortunately it doesn't work in my use case, as I need the sample names to be first, then a suffix and the extension last...
Would the following works for you ?

if [ -n "${suffix}" ]; then
    for file in ${prefix}/*; do
        # Extract the basename
        base_name=\$(basename "\$file")
        # Extract the extension
        extension=""
        # Remove the extension if it exists
        if [[ "\$base_name" =~ \\.(vcf|bcf)(\\.gz)?(\\.tbi|\\.csi)?\$ ]]; then
            extension="\${BASH_REMATCH[0]}"
            base_name="\${base_name%\$extension}"
        fi
        # Construct the new name
        new_name="\${base_name}${suffix}\${extension}"
        mv "\$file" "${prefix}/\$new_name"
    done
fi

@fellen31
Copy link
Contributor

This seems a lot more complicated than

for file in output_dir/*; do
    mv \$file $prefix\$FILE
done

Would it not more clean to adapt the logic in phaseimpute? Perhaps the passing the suffix in the meta? Or adding a rename module, e.g. https://github.com/nf-core/raredisease/blob/master/modules/local/rename_align_files.nf?

@LouisLeNezet
Copy link
Contributor Author

Renaming the file would add a new step, but it would be cleaner for the nf-core repository.
I will do so.

@LouisLeNezet
Copy link
Contributor Author

@fellen31 thanks a lot for the help.
The module should now be cleaner and I will adapt the pipeline to it.
Is it good for you now ?

Copy link
Contributor

@fellen31 fellen31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, while renaming would be an extra step, I think perhaps it's worth it to keep this module a bit simpler.

modules/nf-core/bcftools/pluginsplit/main.nf Outdated Show resolved Hide resolved
modules/nf-core/bcftools/pluginsplit/main.nf Outdated Show resolved Hide resolved
modules/nf-core/bcftools/pluginsplit/main.nf Outdated Show resolved Hide resolved
@LouisLeNezet
Copy link
Contributor Author

@fellen31 is it better now ?

Copy link
Contributor

@SPPearce SPPearce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. You are now making the files both inside an output directory and you are renaming them with a prefix?

To summarise, the module takes one vcf file and makes multiple, but sometimes the output file might have the same name as the input? At which point, why don't you stage the input file into a folder, then make all the output files in the work directory?

@LouisLeNezet
Copy link
Contributor Author

I don't understand. You are now making the files both inside an output directory and you are renaming them with a prefix?

No they were staying in the outputDir.

To summarise, the module takes one vcf file and makes multiple, but sometimes the output file might have the same name as the input? At which point, why don't you stage the input file into a folder, then make all the output files in the work directory?

This is an excellent suggestion ! I didn't though of it...

@fellen31
Copy link
Contributor

Do you still need to have a prefix/suffix to in the output files in this module?

The prefix is usually meta.id which often corresponds to the sample ID, but sometimes you want to change that. But in this module the file names will already have the sample names fixed in the output file names, so I don't really see why you would need a prefix if you can rename the files within your pipeline?

@LouisLeNezet
Copy link
Contributor Author

Do you still need to have a prefix/suffix to in the output files in this module?

I don't think if we keep both input and output in separate folder.

Copy link
Contributor

@SPPearce SPPearce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks good to me then!

@LouisLeNezet LouisLeNezet added this pull request to the merge queue Nov 21, 2024
@LouisLeNezet
Copy link
Contributor Author

Thanks to both of you @SPPearce and @fellen31 !!

Merged via the queue into nf-core:master with commit 14c910a Nov 21, 2024
23 checks passed
@LouisLeNezet LouisLeNezet deleted the bcftools_pluginsplit branch November 21, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants