Josh Pitzalis

An Introduction to Evals at the Application Layer

To visualize this, we have a basketball court.

Screenshot 2025-09-01 at 8

Blue represents shots made, and red represents shots missed.

The first property to consider is that the farther away your shot is from the basket, the harder it is to make.

Another property is that the court has boundaries. So this blue dot—although the shot goes in—is out of the court. So it doesn’t really count in the game.

Let's say you built an app that tells you how many letters are in a given fruit. If you were to ask the app how many Rs are in 'Strawberry', it should say '3'.

In the example above:

Screenshot 2025-09-01 at 8

The first trap to watch out for is wasting time on out-of-bounds queries. It's easy to spend feeling productive, making evals for things your users don't care about. You will have enough problems with queries that your users care about.

Screenshot 2025-09-01 at 8

The next trap is to watch out for a concentrated set of queries. When you understand your court, you're going to understand where the boundaries are, and you want to make sure you test across the entire court.

When you’re making evals, the most important step is understanding your "court".

This means collecting as much data as possible:

There is no shortcut. You have to do the work to understand what your court looks like.

Screenshot 2025-09-01 at 9

Here is an example of what your court will look like if you are doing a good job of collecting data. You should know where the boundaries are. You should be testing inside your boundaries, and you should understand where your system is blue and the spots where it is red.

With an understanding like this, it's relatively easy to say maybe next week we need to prioritise teamwork on that bottom right corner. This group of queries represents something a lot of users are struggling with, and we can work on doing a good job of flipping the tiles from red to blue.