Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: fail if any unknown files exist #182

Open
sstrong99 opened this issue Aug 4, 2023 · 9 comments
Open

feature request: fail if any unknown files exist #182

sstrong99 opened this issue Aug 4, 2023 · 9 comments

Comments

@sstrong99
Copy link

As far as i'm aware, currently it is only possible to check that specified output files match certain criteria. I'm also aware that you can specify that a certain file must not exist. It would also be useful for me to specify that NO other files besides the specified ones exist in a given directory

@rhpvorderman
Copy link
Member

Can you elaborate on this use case a bit? You have a workflow that generates files with competely random filenames (i.e. timestamped) when an error occurs?

How do you propose this would be implemented in an elegant fashion? Wouldn't it be easier to simply use the custom test interface for this problem?

@sstrong99
Copy link
Author

I am working with a very large and complex workflow. When someone adds a new process, sometimes they don't intend for the outputs of that process to get published to the results directory because it's an intermediate process, but if they misconfigure things, then those files can get published. It would be nice if those mistakes could be caught in an automated way using this package.

It may be easier to do this with the custom test interface. I'll look in to that.

I'm not familiar with the code itself, so i'm not sure on a good implementation, but the interface could look something like this

- name: test1
  command: ...
  files:
    - path: output/ # the trailing slash indicates this is a directory, not a file
      should_exist: false # indicates that nothing in this directory should exist unless explicitly allowed
    - path: "output/file.txt"

@rhpvorderman
Copy link
Member

rhpvorderman commented Aug 8, 2023

I do think that is one of the best possible interfaces, but I don't like how now suddenly the should_exist meaning depends on context. This will make the YAML harder to read.

What about a custom test?

import pathlib
from typing import Optional, Set
import pytest 


def check_allowed_paths(
        directory: pathlib.Path, 
        allowed_paths: Set[str], 
        root: Optional[pathlib.Path] = None
):
    if root is None:
        root = directory
    for path in directory.iterdir():
        if path.is_dir():
             check_allowed_paths(path, allowed_paths, root)
        elif path.is_file() or path.is_symlink():
            relpath = path.relative_to(root)
            if str(relpath) not in allowed_paths:
                 raise FileExistsError(f"{relpath} not allowed!")


ALL_WORKFLOW_FILES = [
    "bam/test1.bam",
    "vcf/test1.vcf",
]

MULTISAMPLE_FILES = [
    "bam/test2.bam",
    "vcf/test2.vcf",
]

ANNOTATED_FILES = [
    "vep/test1.vep.vcf",
]


@pytest.mark.workflow("annotated")
def test_annotated_workflow(workflow_dir):
    check_allowed_paths(
        workflow_dir,  set(ALL_WORKFLOW_FILES + ANNOTATED_FILES))

@pytest.mark.workflow("multisample")
def test_annotated_workflow(workflow_dir):
    check_allowed_paths(
        workflow_dir,  set(ALL_WORKFLOW_FILES + MULTISAMPLE_FILES))

EDIT: Even better: you could use the yaml library and the __file__ keyword to get the test yaml file (it should be possible to state the relative location to this file) and parse it. Then you can use the paths directly from the yaml to define the allowed files. Eliminating code redundancy. I leave that as an excercise to the reader though, I am not going to code that in a github comment window ;-).

@sstrong99
Copy link
Author

sstrong99 commented Aug 8, 2023

Thanks! I started looking through the code to see if there was a way to automatically append this custom test to every workflow, without having to copy/paste the custom test for every new workflow. In the process i put together a example (rough) implementation of this in the code: #183

The interface is a bit different than what i mentioned above. In order to check for unknown files you'd add the fail_if_unknown key to workflow test in the yaml. e.g.

- name: test1
  command: ...
  fail_if_unknown:
    - output/
    - some_other_dir/
  files:
    - path: "output/file.txt"

@rhpvorderman
Copy link
Member

Sorry, I think this feature is too specific to your needs to warrant a change to the test YAML. Why not write a custom test as mentioned above?
You could automate that as well with pytest.mark.parametrize and parsing the YAML file yourself.

@sstrong99
Copy link
Author

makes sense. the only reason i thought a custom test wasn't quite as good is because you have to duplicate the test collection. Do you know if there's a way to hook into pytest's test collection to make sure that the custom test applies to all the relevant yaml files?

If not, another idea would be to extend the pytest.mark.workflow() to accept no workflow name, in which case it applies to all workflows.

@rhpvorderman
Copy link
Member

rhpvorderman commented Aug 9, 2023

makes sense. the only reason i thought a custom test wasn't quite as good is because you have to duplicate the test collection. Do you know if there's a way to hook into pytest's test collection to make sure that the custom test applies to all the relevant yaml files?

You can parse the YAML yourself. That should give you all the workflow names and all the relevant files. In that case there will be no duplication.

@sstrong99
Copy link
Author

There is duplication in finding the yaml files. pytest's test collection handles that an any difference between the custom code and pytest's test collection will result in missed tests

@rhpvorderman
Copy link
Member

Sorry for my late reply. I have been very busy lately and haven't kept up.

There is duplication in finding the yaml files. pytest's test collection handles that an any difference between the custom code and pytest's test collection will result in missed tests

Yes, there will be duplication. Sometimes a little duplication is better than adding extra complexity. It is possible to take DRY too far. I would take simplicy and a little duplication over complexity every single time. But you might have noticed that from the design of pytest-workflow already ;-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants