Building Automated Evaluators
I'm working through Hamel Husain and Shreya Shankar's, AI Evals For Engineers & PMs course at the moment. I wanted to process and share my current understanding of how to evaluate the AI bit when you're building an AI feature.
First instrument your application
Instead of manually checking every AI response, you need a way to review thousands of responses quickly. There are tools for this (I've used langfuse and braintrust so far) or you can build your custom dashboard for this stuff (which is useful if you use AI in a complex way that standard tools will struggle with).
Then establish your failure modes
Once you have data to work with, the next step is to look at your responses. Go through about 100 responses or session logs, and grade them one-by-one. The goal here is develop a list of all the different types of problems that exist — these are called you failure modes.
It is tempting to try and automate this step. Hamel and Shreya stress the importance of doing this task manually. If you don't have a clear understanding of the problems, the quality of your data will be contaminated form this point onwards. If you try and define your failure modes upfront and then review through the data to find examples, confirmation bias will inflate the problems you expect to see and obscure the ones you weren't expecting.
Now figure out how prevalent each error type is
There are two way to do this. The first is with code based-evaluators. This means writing conventional, deterministic code to detect an error (regex, string length checks, the presence of certain types of tool invocations, etc).
When you can detect an error using deterministic code, you should. It's fast, reliable and free.
LLM-as-Judge Evaluators
When you need to make subjective judgments like "Is this response helpful?" or "Is the tone appropriate here?, conventional code won't cut it.
Setting up a good LLM Judge involves building a mini AI feature with its own prompt (you need to write a separate prompt for your LLM judge) and it's own data set (about 100 manually marked "good" or "bad" responses).
Then you split your data into three groups:
- Training set (20% of the 100 marked responses): Used as examples in your prompt
- Development set (40%): To test and improve your judge
- Test set (40%): For the final scoring
Now you have to iterate and improve your judge's prompt until it agrees with your labels. The goal is 90%> True Positive Rate (TPR) and True Negative Rate(TNR).
- TPR - How often the LLM correctly marks your passing responses as passes.
- TNR - How often the LLM marks failing responses as failures.
A good Judge Prompt will evolve as you iterate over it, but here are some fundamentals you will need to cover:
- A Clear task description: Exactly what you want evaluated
- Precise pass/fail definitions: What counts as good vs bad
- Few-shot examples: Show the judge 2-3 examples of good and bad responses
- Structured output: Ask for reasoning plus a final judgment
Some other gotchas to watch out for...
- Focus on one specific failure at a time - don't ask the judge to check 15 things at once - When in doubt split the process into multiple judges that each evaluate one specific thing.
- Make the criteria binary - pass or fail, not a 1-10 score.
- Never leak test data - don't accidentally include test examples in your judge prompt.
- Building judges for vague criteria like "helpfulness"- the real work is basically building out a precise rubric for what helpful means to your application in this specific context.