You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
have small prompts that do one thing, and only one thing well. e.g. instead of having a single catch-all-prompt, try to split it into separate prompts that are simple, focused and easy to understand -> so you can eval each prompt separately.
RAG information density: if two documents are equally relevant, we should prefer one that is more concise and has fewer erroneous details.
Multistep workflow: Include reflection/CoT prompting (small tasks)
Increase output diversity beyond increasing temperature. E.g. when the user is asking for a solution to XX problem, keep a list of recent responses and tell the LLM, "do not suggest any responses from the following:"
Prompt caching: e.g. common functions. Use features like autocomplete/spelling correction/suggested queries to normalize user input and increase the cache hit rate.
Simple assertion based unit tests.
- What do each tool abbreviation in OR mean?
- What are the supported public PDKs?
- What are the supported OSes?
- What are the social media links?
pairwise evaluation. How is this different from normal? maybe there are a few responses (using different LLMs) that are rated same score on g-eval. Use pairwise evaluation to force a winner. E.g. prompt
Needle-in-a-haystack (NIAH) evals
If evals are reference-free, you can use them as a guardrail (not show the output if it is too low scoring). E.g. is summarization evals, where all you need is the input prompt (and no need for a summarisation "reference")
Guardrails
Use gemini guardrails to identify harmful/offensive output, PII.
Development-prod skew: measure skew between LLM input/output pairs. E.g. length of inputs/outputs, specific formatting requirements. For advanced drift detection consider clustering embeddings to detect semantic drifts (i.e. users are discussing topics not discussed before.)
Hold-out datasets for evals -> must be reflective of user-interactions
Always log outputs. Store this in a separate DB.
Data flywheel
Links to the automated feedback loop. Bad examples can be used to train hallucination classifiers. Relevant annotations can be used to train relevance-reward model(https://arxiv.org/abs/2009.01325).
Evals
Guardrails
Production
Data flywheel
References
The text was updated successfully, but these errors were encountered: