We Built an Impressive AI – But It Was Useless
A few years ago, I worked on my first Generative AI project: a customer-facing AI assistant. The company had unique customer data and believed AI could transform it into a highly personalized, valuable chatbot.
We built a prototype fast. Our users were excited about the first demo.
We were not our Users
But there was a catch: we had almost no access to real users. With so few user tests, we had to rely on ourselves—we became our own test users.
We tried to think, behave, and react like our users. At first, it seemed to work well—for us. We spotted issues, fixed them, and each iteration felt like progress
Then we put the chatbot in front of real users. They barely noticed the difference.
We had solved real issues. But were they the right ones? Not really.
We Kept Iterating—But Nothing Changed
Over the next months, we started seeing patterns in their feedback—but every iteration still felt slow and uncertain.
During the long gaps between user tests, we had no choice but to guess whether our changes would make a real difference.
We still didn’t know how to properly evaluate it. We kept wondering: is it really working, or only on our pre-selected examples?
Eventually, our AI delivered that initial “wow” effect. Users tried it—but they didn’t come back.
Technically, the AI performed well. But none of that mattered, because it wasn’t aligned with what customers actually needed
We thought we were listening carefully to our users. Not enough.
We were gathering feedback—but too slowly, and without a clear way to turn it into meaningful changes. That’s what made the difference
Success = Great Feedback Loop
Shorten the Loop!
Our feedback loop was painfully slow. At the time, we told ourselves it was out of our control, but looking back, that’s exactly where we should have focused.
Later, I realized something crucial: even if direct user feedback isn’t always available, you can create a proxy if you choose the right metrics.
Learn what Truly Matterss
I mean, this sounds great on paper, but it’s much harder in practice. We also tried to set up metrics, but had no idea how to properly do it.
Let’s dig into it!
The hardest part isn’t just building metrics, it’s figuring out which ones actually matter. A metric is only useful if it tightly correlates with actual outcomes.
The only reliable way to learn this is by talking to users and looking at your data. Skipping this step almost guarantees you’ll optimize for the wrong goals.
Don’t assume you know the right metrics upfront. Uncover them through iteration, so they stay grounded in reality.
Your job isn’t to assume the problem—it’s to uncover it. Unless you are your own ideal customer (ICP: Ideal Customer Profile), you don’t truly feel their pain. Your goal is to learn what actually matters, then design the right solution around it.
This isn’t a one-shot process where you define metrics once and move on. Your understanding of success will evolve over time, so you should treat this as an ongoing, iterative process.
Early in a project, you’ll spend more time refining what matters, and then less overtime as your understanding deepens.
Setup the right Metrics
Alongside gathering direct user feedback, you should identify strong proxies for success.
The most reliable metrics are those that can be computed deterministically—meaning they provide clear, objective measurements.
For example, suppose user testing reveals that the ideal summary length for your use case is between 300-500 words. While this alone doesn’t guarantee quality, it’s an easy deterministic check that helps filter out summaries that are either too short or too long.
Now consider a less tangible requirement: “The output should be formal and corporate-friendly.” Unlike length, there’s no direct way to measure this, so how do you approximate it?
The best approach is to start with human evaluation. Manually reviewing outputs, identifying recurring patterns, and gradually building an approximation based on real-world examples. It’s a big and complex topic, so I’ll dedicate an article exclusively for this.
Once you have a clear idea on what matters, and a reliable way to approximate it, you can iterate much faster, with more confidence on what truly matters!
Always remember: Metrics are only useful if they are tightly aligned with real-world success. At any stage, you may be optimizing the wrong thing, or measuring it the wrong way. Metrics are a means to an end, not the end itself, so continuously cross-check them against real-world impact.
Your competitive advantage is not your Architecture or Prompt
The real challenge for AI engineers isn’t optimizing models—it’s defining and approximating success. Once you know what success looks like and can measure it reliably, improving your AI becomes surprisingly easy.
This is the silent trap that catches many teams. They think they’re improving AI when, in reality, they’re just making it more complex.
Building a complex solution is easy. Building the right system that lets you scale AI quality effectively is not.
Your model, prompt, and architecture should merely be seen as byproducts of your deep understanding of users and your ability to measure what truly matters.
Hand the same model to a competitor, and they won’t be able to iterate effectively, because they don’t have the deep understanding you built.
If you are truly looking for a competitive edge, don’t invest in the byproduct, invest in the factory.