Josh Pitzalis

Building Automated Evaluators

I'm working through Hamel Husain and Shreya Shankar's, AI Evals For Engineers & PMs course at the moment. I wanted to process and share my current understanding of how to evaluate the AI bit when you're building an AI feature.

First instrument your application

Instead of manually checking every AI response, you need a way to review thousands of responses quickly. There are tools for this (I've used langfuse and braintrust so far) or you can build your custom dashboard for this stuff (which is useful if you use AI in a complex way that standard tools will struggle with).

Then establish your failure modes

Once you have data to work with, the next step is to look at your responses. Go through about 100 responses or session logs, and grade them one-by-one. The goal here is develop a list of all the different types of problems that exist — these are called you failure modes.

It is tempting to try and automate this step. Hamel and Shreya stress the importance of doing this task manually. If you don't have a clear understanding of the problems, the quality of your data will be contaminated form this point onwards. If you try and define your failure modes upfront and then review through the data to find examples, confirmation bias will inflate the problems you expect to see and obscure the ones you weren't expecting.

Now figure out how prevalent each error type is

There are two way to do this. The first is with code based-evaluators. This means writing conventional, deterministic code to detect an error (regex, string length checks, the presence of certain types of tool invocations, etc).

When you can detect an error using deterministic code, you should. It's fast, reliable and free.

LLM-as-Judge Evaluators

When you need to make subjective judgments like "Is this response helpful?" or "Is the tone appropriate here?, conventional code won't cut it.

Setting up a good LLM Judge involves building a mini AI feature with its own prompt (you need to write a separate prompt for your LLM judge) and it's own data set (about 100 manually marked "good" or "bad" responses).

Then you split your data into three groups:

Now you have to iterate and improve your judge's prompt until it agrees with your labels. The goal is 90%> True Positive Rate (TPR) and True Negative Rate(TNR).

A good Judge Prompt will evolve as you iterate over it, but here are some fundamentals you will need to cover:

  1. A Clear task description: Exactly what you want evaluated
  2. Precise pass/fail definitions: What counts as good vs bad
  3. Few-shot examples: Show the judge 2-3 examples of good and bad responses
  4. Structured output: Ask for reasoning plus a final judgment

Some other gotchas to watch out for...