TODO collections #50

luarss · 2024-08-10T18:33:02Z

- What do each tool abbreviation in OR mean? 
- What are the supported public PDKs?
- What are the supported OSes?
- What are the social media links?

Evals

https://docs.confident-ai.com/docs/synthesizer-introduction#save-your-synthetic-dataset
pairwise evaluation. How is this different from normal? maybe there are a few responses (using different LLMs) that are rated same score on g-eval. Use pairwise evaluation to force a winner. E.g. prompt
Needle-in-a-haystack (NIAH) evals
If evals are reference-free, you can use them as a guardrail (not show the output if it is too low scoring). E.g. is summarization evals, where all you need is the input prompt (and no need for a summarisation "reference")

Guardrails

Production

Development-prod skew: measure skew between LLM input/output pairs. E.g. length of inputs/outputs, specific formatting requirements. For advanced drift detection consider clustering embeddings to detect semantic drifts (i.e. users are discussing topics not discussed before.)
Hold-out datasets for evals -> must be reflective of user-interactions
Always log outputs. Store this in a separate DB.

Data flywheel

Links to the automated feedback loop. Bad examples can be used to train hallucination classifiers. Relevant annotations can be used to train relevance-reward model(https://arxiv.org/abs/2009.01325).

References

The text was updated successfully, but these errors were encountered:

luarss changed the title ~~CI Status badge~~ TODO collections Sep 10, 2024

luarss pinned this issue Oct 1, 2024

Provide feedback