I’m distilling some of those lessons here. The blind spots, the surprising insights, and the things that turned out to matter more than I expected.
If you’re working on AI and trying to improve it, you’ll probably find something useful here.
I’m distilling some of those lessons here. The blind spots, the surprising insights, and the things that turned out to matter more than I expected.
If you’re working on AI and trying to improve it, you’ll probably find something useful here.
When you set up evaluation, you’re not just adding a score. You’re building a system that decides what “good” looks like. Most teams don’t treat it that seriously. They build evals by convenience. Automate too early. Use generic metrics. Plug in LLM-as-a-Judge with a vague prompt and hope it works. It looks scientific, but in practice it’s just noise. Eventually they end up with an AI that does great on their internal benchmark but that none of their users really sticks to. ...
You’ve built a chatbot. Maybe even launched it. You’ve made changes—prompt tweaks, retrieval tricks, some reranking logic. It’s probably better than it was. But is it really getting better? Can you tell? Most teams can’t. Not because they’re bad. Because they don’t have the right visibility. That’s what this post is for. If you’re stuck, walk through these six questions. They’ll show you what’s missing. 1. Do you have logs? Can you open up a list of recent conversations and look at what users said, what the system answered, and how the flow went? ...
Many teams I’ve worked with start out making rapid progress with their AI - usually a RAG system. At first, iteration is fast. Each tweak makes a visible difference. But then, they hit a plateau. Changes don’t seem to make a clear impact anymore. Some changes even backfire. The AI isn’t broken, but it’s no longer clear what to fix—or even how to tell if it’s improving at all. When Metrics Stop Making Sense That’s when most teams turn to metrics. And that’s where things usually go wrong. ...
A few years ago, I worked on my first Generative AI project: a customer-facing AI assistant. The company had unique customer data and believed AI could transform it into a highly personalized, valuable chatbot. We built a prototype fast. Our users were excited about the first demo. Our First Big Mistake But there was a catch: we had almost no access to real users. With so few user tests, we had to rely on ourselves—we became our own test users. ...