You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to figure out what metrics we can/should use to evaluate our models, and what data is needed to evaluate them.
Here we probably will make some distinction between evaluations during prototyping (which will need to be largely automated), and metrics that may be used later and evaluated by users with e.g., A/B testing. For now, let's just consider the former (metrics we can use now with some data).
The text was updated successfully, but these errors were encountered:
@zhao1402072392 will look into what metrics we should be using for evaluating retrieval results. This is most likely based on regular information retrieval as opposed to new LLM-based evaluation schemes as we like evaluating with "ground truths". While our labels might still be slightly subjective, at least they will be consistent and (hopefully) more accurate.
We need to figure out what metrics we can/should use to evaluate our models, and what data is needed to evaluate them.
Here we probably will make some distinction between evaluations during prototyping (which will need to be largely automated), and metrics that may be used later and evaluated by users with e.g., A/B testing. For now, let's just consider the former (metrics we can use now with some data).
The text was updated successfully, but these errors were encountered: