Pitfalls to avoid when building your first set of evals
The whole point of building an evaluation suite is so that you can systematically improve your AI product. Without evaluations, your only option is to respond to problems as they arise. As users complain about your app doing something unexpected, you rush to fix it. When you try to fix things, your only option is to test the improvement out on a handful of examples. You never really know if it's fixed or not. You just do your best. Then you realise you've broken something else that used to work before, so then you rush to fix that. The whole process starts to feel like playing whack-a-mole, and it gets exhausting.
Quality is subjective, so it's hard to pin improvements down to a single measure. In practice, most applications use five to ten different evaluation metrics (correctness, personalisation, latency, etc).
The way that you set these metrics up is to start with your raw data. The actual conversations or queries that your users submit in each session. You must understand what people are doing with your product and how the AI component in your system responds. This sounds obvious, but you'd be surprised at how many people are not aware of the different types of queries the products they build have to process. If you’re blind to this on-ground reality, then you can’t build an effective product.
The easiest mistake to make when building your first set of eval is to not look at the actual conversations flowing through your product.
The big problem here is that there are usually way too many conversations to manually review. If your app runs thousands of sessions a day, it's impossible to read through every one. Capturing the diversity of queries your system deals with involves building tools and processes to surface and review the highest-priority data so that you can build the best mental model possible.
If you are looking at your data, the next mistake to make is not looking at enough data.
The goal here is to avoid looking at a handful of examples and then form your entire quality hypotheses off the back of five conversations. There are no hard numbers for how much data you should be looking at; the aim is to look at enough data to stop surfacing new types of errors.
The tendency here is to outsource the manual process of reviewing conversations, either to an engineer or (even worse at this stage) to an LLM. Evaluation is not purely an engineering task; a real-life human specialist (in your domain) must weigh in on what good and bad conversations look like at some point. The manual process of reading conversations, analysing them and then annotating types of errors is the most important work in the evaluation building process because it's where you learn the most.
Most people try to skip this step. Partly because we're all lazy, but also because there isn't much industry guidance on how to do it well. If you do your best to analyse and label errors in your conversations, it sets you up for success in every other downstream phase of the eval building process.
The reason why analysis is so difficult, and why it's so important to do, is that when you're reading through a conversation, you don't know what you are looking for. To illustrate this, a great example I learnt from Shreya and Hamel's Eval course was based around an email assistant —an AI assistant that helps recruiters reach out to lots of people. One of the metrics they used in their evals was 'edit distance': if the AI produces something and the human edits it less, that seems better. But when they looked at the raw data closely, they found that lots of non-native English speakers were editing the results to be grammatically incorrect for this specific task. That made 'edit distance' a terrible metric. Before jumping to metrics, you have to look at your data to get a sense of what’s coming in and what’s going out so that you can figure out what's actually going on. Only then will you be able to articulate an issue specifically enough to develop an evaluation for it.
If the first step in the process is looking at your data and figuring out what type of failures your app encounters, the second step is to quantify how prevalent each type of failure is. Since we can't label every single conversation, we have to build automated evaluators to translate the qualitative insights from the error analysis process into quantitative measurements for each type of error in your system.
There are two pitfalls to avoid here.
Over-relying on LLM judges.
People love to use LLM judges. Often, the judge prompts make no sense or are misaligned with developer expectations. If you have a bad LLM judge, you’ll get unreliable data about how prevalent failure modes are. To avoid this, you always want to rely on a code-based evaluator when you can. Code-based evaluators are also faster and cheaper.
If you do need to resort to an LLM Judge, then the next mistake you can make is including good and bad examples in your judge prompts and then testing on those same examples. This results in perfect agreement on training examples, and then you see garbage on production data. This is called data leakage.
Don’t leak few-shot/training examples into your test set.
Setting up a good LLM Judge basically involves building a whole separate mini AI feature with its own prompt (you need to write a separate prompt for your LLM judge) and its own data set (about 100 manually marked "good" or "bad" responses).
Then you split your data into three groups:
- Training set (20% of the 100 marked responses): Used as examples in your prompt
- Development set (40%): To test and improve your judge
- Test set (40%): For the final scoring
Once you know what kinds of failures your system presents and you know how often each error occurs, the last step is to start fixing things. The main trap to fall into here is jumping to complex architectures or automated solutions before doing the simple stuff. Start with a good prompt, run it on data, do error analysis, once you have a baseline, then think about how to improve things. Improving things becomes easier when you can measure things. This is where you can pull out your toolbox: prompt engineering, changing models, fine-tuning, using RAG, using agents, whatever you want to throw at the problem.
Just start with the simple stuff first.