Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reference metadata handling for EventSource #2648

Merged
merged 4 commits into from
Nov 18, 2024

Conversation

maxnoe
Copy link
Member

@maxnoe maxnoe commented Nov 12, 2024

The recent addition of attaching the reference metadata of input files to the provenance information implemented only that the reference metadata is read directly by ctapipe from the input file. This made it necessary to either support all possible input file types
or impossible for plugin event sources to provide this metadata on their own. It also makes the assumption 1 EventSource = 1 input file. This is not true for some event sources.

This issue is solved by:

  • Adding the possibility to directly provide the reference metadata to add_input_file
  • Move the responsibility of calling add_input_file from the EventSource baseclass to the implementation

This is an alternative implementation to #2644

The recent addition of attaching the reference metadata
of input files to the provenance information implemented
only that the reference metadata is read directly by
ctapipe from the input file. This made it necessary
to either support all possible input file types
or impossible for plugin event sources to provide this
metadata on their own. It also makes the assumption
1 EventSource = 1 input file. This is not true for some
event sources.

This issue is solved by:
* Adding the possibility to directly provide the
  reference metadata to `add_input_file`
* Move the responsibility of calling `add_input_file` from
  the EventSource baseclass to the implementation

This comment has been minimized.

This comment has been minimized.

@kosack
Copy link
Contributor

kosack commented Nov 13, 2024

Looks like a good idea. Probably need some tests, particularly to ensure that anyone implementing an EventSource plugin doesn't forget to add the input file to the provenance.

@maxnoe
Copy link
Member Author

maxnoe commented Nov 13, 2024

anyone implementing an EventSource plugin doesn't forget to add the input file to the provenance.

MMh. The only way we could enforce this I think is to add an abstract method to the EventSource that subclasses then have to implement.

However, I want to support event sources that have multiple input files, as I know those exist (and we'll have e.g. parallel zfits streams also for the ACADA data).

So the API should be something like this:

@abstractmethod
def get_input_reference_meta(self) -> dict[str, ReferenceMeta]:
    pass 

But the issue here is that I know that some event source implementations only open files one-by-one.

So I think the solution above is not right. It's really the event sources that have to internally call Provenance().add_input_file(path, reference_meta=reference_meta) for each file they open.

I don't think we can have a unit test here in ctapipe to enforce this.

@maxnoe
Copy link
Member Author

maxnoe commented Nov 13, 2024

Probably need some tests, particularly to ensure that anyone implementing an EventSource plugin doesn't forget to add the input file to the provenance.

Added explicit tests now for the two EventSource implementations we have here in ctapipe

@maxnoe maxnoe marked this pull request as ready for review November 13, 2024 12:26
Copy link

Passed

Analysis Details

3 Issues

  • Bug 0 Bugs
  • Vulnerability 0 Vulnerabilities
  • Code Smell 3 Code Smells

Coverage and Duplications

  • Coverage 98.80% Coverage (94.30% Estimated after merge)
  • Duplications 0.00% Duplicated Code (0.70% Estimated after merge)

Project ID: cta-observatory_ctapipe_AY52EYhuvuGcMFidNyUs

View in SonarQube

@maxnoe maxnoe added this to the 0.23.0 milestone Nov 13, 2024
@kosack
Copy link
Contributor

kosack commented Nov 14, 2024

I want to support event sources that have multiple input files, as I know those exist (and we'll have e.g. parallel zfits streams also for the ACADA data).

The reference metadata was designed specifically to be an output, not an input, so this may not be fully necessary to go that deep here. All we need to get is the info that we will eventually need to write out, so e.g. the full list of input product_ids is not necessary. The back-links to that info is what would be done in a provenance database derived from the provenance logs.

So I think the issue with multiple inputs in the first element in the processing chain is: how to assign a product_id to the set of products? For current data from e.g. LST et al, this is not solved, however for CTAO since we will in any case have to define a "observation dataset package" for a single observation (containing all event, monitoring, service data). That package will have a single product_id as well as obs_id that can be traced to its contents externally. For raw DL0 data that should never change, it could be that we define product_id == obs_id (in the case of DL0 observation datasets) and be done with it.

@maxnoe
Copy link
Member Author

maxnoe commented Nov 14, 2024

I am surprised to hear this... in the original issue #2571 (comment) and the implementation #2598 you supported the idea of getting the product ids for inputs

@kosack
Copy link
Contributor

kosack commented Nov 14, 2024

I am surprised to hear this... in the original issue #2571 (comment) and the implementation #2598 you supported the idea of getting the product ids for inputs

I think I didn't explain well: I do support this (product_ids are better than just a list of input filename, as they provide a real link with no possibility of collisions). This is a special case however: we are talking about the first thing in the processing chain (a "source") where we don't have control over how the inputs were generated. So my suggestion is we could support the case of inputs where we know there cannot be multiple reprocessed copies by just using something like the obs_id in that case. It's not perfect, but solves the issue of not having a UUID generated by things outside of our control. It only works for DL0 data.

The other issue is the difference between "Local Provenance" (inputs and outputs of an Activity) and"ReferenceMetedata (which relates only to a single Data Product). The former is for full provenance linking, and the latter only stores the forward-part of the provenance, and doesn't include the list of input files. The use cases are different: provenance is for looking at the full graph, while the ReferenceMetadata is for retrieving a file by common attributes. So I think we need the input product_ids in the local provanance, but not in the Reference metadata.

TLDR: we do need to read ReferencemMetadata of the inputs, as it is necessary for provenance. It is not propegated to the output ReferenceMetadata, however. To solve places where we have no product_id, we could just suggest that the EventSources put in the obs_id or something like that (I don't really like mixing them though, so maybe we need to think a bit on the model). In any case, this PR is good.

@maxnoe
Copy link
Member Author

maxnoe commented Nov 14, 2024

So I think we need the input product_ids in the local provanance, but not in the Reference metadata.

That's how this works here. It is attaching the reference metadata of input files to the provenance, not to the output reference meta.

Tobychev
Tobychev previously approved these changes Nov 18, 2024
@maxnoe maxnoe requested a review from Tobychev November 18, 2024 15:40
@maxnoe maxnoe merged commit c1e7163 into main Nov 18, 2024
12 checks passed
@maxnoe maxnoe deleted the reference_meta_event_source branch November 18, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants