Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide to adding a new measure? #27

Open
cmacdonald opened this issue Nov 26, 2021 · 9 comments
Open

guide to adding a new measure? #27

cmacdonald opened this issue Nov 26, 2021 · 9 comments

Comments

@cmacdonald
Copy link
Collaborator

Can I make a measure given a simple function like a lambda?

@seanmacavaney
Copy link
Collaborator

seanmacavaney commented Nov 26, 2021

I'm happy to add this feature if it's a common enough situation. What's the case you have in mind? Is it like an experimental measure you want to build yourself? Or interfacing with some other library (perhaps sklearn's precision/recall/f1 impl or smth).

I don't think a lambda definition would be useful for most of the current providers. For efficiency purposes, it's super beneficial to be able to perform operations in batch. E.g., trec_eval builds its own structure in memory for efficiently looking up query/doc pairs.

A potential interface would be something like:

ir_measures.define(lambda runs, qrels: xxx)
# where:
#   - runs: a list of all document scores, perhaps as a dataframe or dictionary?
#   - qrels: similar to runs
#   - xxx would provide Metric values for every query

ir_measures.define_byquery(lambda run, qrels: xxx)
# where:
#  - run: document scores for a single query, perhaps as a dataframe or dictionary?
#  - qrels: similar to run (for matching query)
#  - xxx would return a single Metric value; assumes mean aggregation of the scores? (Or this could be an optional argument?)

I think the latter one would probably be useful in more situations.

@cmacdonald
Copy link
Collaborator Author

This was primarily for things like ROUGE in #28
In your API, presumably a measure name would have to be defined also?

My thinking was that in PyTerrier sometimes we would want to define an additional kind of measure to report in a pt.Experiment() (e.g. fairness ;-)

@seanmacavaney
Copy link
Collaborator

I imagined that the above methods would return a measure. E.g.,

MyAwesomeFairnessMeasure = ir_measures.define_byquery(lambda run, qrels: xxx)

which then could be used alongside other measures.

But what are the situations where you'd want to define a measure but not write an optimized & shareable version of it?

@cmacdonald
Copy link
Collaborator Author

But what are the situations where you'd want to define a measure but not write an optimized & shareable version of it?

Same reason as https://pyterrier.readthedocs.io/en/latest/terrier-retrieval.html#custom-weighting-models
To allow trying something out...

@cmacdonald
Copy link
Collaborator Author

But what are the situations where you'd want to define a measure but not write an optimized & shareable version of it?

Another example - I have a column in the results dataframe that I would like to summarise and report as part of the measurements.

@seanmacavaney
Copy link
Collaborator

Makes sense. Would the define and define_byquery proposal above meet those needs?

@seanmacavaney
Copy link
Collaborator

@cmacdonald I have a prototype of "runtime-defined" measures in the runtime branch. See an example usage of them in the test here.

Does it look reasonable? You mention using lambdas above, and while this is supported, I struggle to find anything very meaningful to do in just an inline function (though I'm far from a pandas ninja).

I'm open to alternative names for this feature as well. A similar feature is currently WIP in ir-datasets that I'm calling local datasets -- but unlike that feature, those are persisted to disk. These only last as long as the python interpreter is running, and I don't really see a way around that.

What these don't (yet?) support:

  • Parameters, e.g., MyMeasure(rel=2)@5 wouldn't be supported. this would probably need to be a third argument passed to the implementation function/lambda.
  • Alternative aggregators -- only mean is supported now. (This could be an additional argument to define and define_byquery?)
  • Alternative input formats -- only pandas dataframes currently supported
  • When using define_byquery, every query that appears in the run must produce exactly one score. define is more flexible and can return multiple metrics, perform whatever filtering it likes, etc. But is a little trickier to use.

@cmacdonald
Copy link
Collaborator Author

My use case has averaging row values (eg doc length) while conducting an experiment. I'll try it out in a Colab. A rank cutoff would be useful. Eg avg doc len @ 5, avg doc len@10.

@seanmacavaney
Copy link
Collaborator

Ah, sure, so if you don't need to do any merging with the qrels, things can be easier. E.g.,

AvgDoclen = pt.define_byquery("AvgDoclen", lambda qrels, run: run['doc_id'].apply(index.get_doclen).mean())

Since the rank cutoff is probably common and well-defined, this could be something easy to switch on as an additional (optional) argument:

AvgDoclen = pt.define_byquery("AvgDoclen", lambda qrels, run: run['doc_id'].apply(index.get_doclen).mean(), rank_cutoff=True)
AvgDoclen@5 # would automatically filter down the result list to the top 5

seanmacavaney added a commit that referenced this issue Mar 4, 2022
seanmacavaney added a commit that referenced this issue Mar 4, 2022
* runtime-defined measures (see #27)

* reworking runtime-defined measures based on feedback from @cmacdonald
 - Support "cutoff" parameter (default on)
 - Name optional (defaults to repr of impl)
 - Runtime-defined measures don't get registered
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants