You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
That would be a bit of an undertaking, given that the current architecture is built around the many-to-many nature of search result lists ((query_id, doc_id) -> score mappings) and qrels ((query_id, doc_id) -> relevance mappings). ROUGE and similar measures operate over query_id -> text mappings and query_id -> [possible text answers] mappings.
But I could see how it could work. The structure already allows for various input data formats, so this new type of mapping would just be another one. If you request a qrel-oriented measure but provide text mappings instead (or vise versa), it would just throw an error.
I'd need to familiarise myself with the landscape of these measures too. IIRC there's a ton of fragmentation there as well.
It's worth also considering limiting the scope of this tool to only qrel-oriented measures.
I think with longer QA pipelines involving retrieval and other NLP techniques, e.g. conversational QA, there might be something interesting in putting that as part of pt.Experiment().
If QA was a stage of the pipeline, how could we measure some ROUGE metrics or similar at the end of a Pyterrier pipeline?
The text was updated successfully, but these errors were encountered: