Your AI Works in the Demo. But Is It Ready for Production?
Many AI projects never make it past the prototype stage, but a structured approach to AI quality and observability can make the difference between your project stalling and shipping. In this article, we share how we helped a client bridge that gap, built around three principles:
-
Observability must be a foundational part of your AI architecture: You cannot improve what you cannot measure.
-
Use structured test datasets and AI-driven quality scoring. This removes guess-work when making changes and enables rapid iteration.
-
Apply the same techniques to live data. Create continuous improvement loops that build confidence, control cost, and reduce risk over time.
The Context
We recently helped a client take an AI agent from prototype to production-ready. The prototype was functional. It answered questions, it looked impressive in the demo. But when we started stress-testing it for real-world use, the cracks appeared quickly.
The AI gave advice on legally sensitive topics without guardrails. It could be manipulated into talking like a pirate. It felt like a bot rather than an assistant, reactive rather than proactive, with no awareness of previous interactions. In short: it worked, but it wasn't good enough to put in front of customers.
This is not an unusual story. Industry research paints a consistent picture: fewer than half of AI projects make it from prototype to production, and the average journey takes around eight months. Gartner predicted that at least 30% of generative AI projects would be abandoned after proof of concept by the end of 2025, citing poor data quality, inadequate risk controls, escalating costs, and unclear business value as the primary culprits.
“After last year's hype, executives are impatient to see returns on GenAI investments, yet organizations are struggling to prove and realize value.”
The gap between "it works in the demo" and "it works in our business" has become one of the most expensive problems in enterprise technology. And in our experience, a significant part of that gap comes down to one thing: the absence of a structured approach to AI quality and observability.
The Problem with Non-Deterministic Systems
Traditional software is deterministic. Given the same input, you get the same output. You can write unit tests, you can define expected behaviour, and you can be confident that if it passes QA, it will work the same way in production.
AI systems built on large language models are fundamentally different. The same question asked twice may produce two different answers, both of which could be correct, partially correct, or completely wrong. The model might handle a thousand queries flawlessly and then produce something entirely unexpected on the thousand-and-first.
This non-deterministic nature makes traditional testing approaches insufficient. You cannot simply write a test that checks for the "right" answer, because there often isn't a single right answer. What you can do is systematically measure quality across multiple dimensions and track how that quality changes over time.
That is what AI observability is about: Making the invisible visible, so you can manage what you cannot fully predict.
Our Approach: Observability as a Foundation, Not an Afterthought
When we started working with this client, our first act was not to tweak prompts or swap models. It was to introduce an AI quality and observability framework into the architecture. Before you can improve something, you need to be able to measure it.
We designed test datasets that represented the full spectrum of the client's concerns. These were not generic benchmarks but scenarios drawn directly from the real-world risks and requirements of their specific use case:
-
Basic quality covered whether the AI provided accurate, helpful answers to straightforward, on-topic questions. This is the baseline: does it actually do what it is supposed to do?
-
Advanced quality tested more nuanced behaviour. Does the AI remember context from previous interactions? Can it be proactive rather than simply reactive? Does it feel like an assistant rather than a search engine?
-
Compliance addressed the legally sensitive territory the client operated in. The AI needed to know what it should and should not advise on, and handle those boundaries gracefully rather than blundering into areas that could create regulatory exposure.
-
Malicious usage examined how the system responded to deliberate attempts to circumvent its guardrails. Prompt injection, manipulation, attempts to get the AI to behave outside its intended scope. The "talking like a pirate" problem, and its more dangerous cousins.
-
Multi-language ensured quality held up across the languages the client's customers use. Quality that only exists in English is not quality in a multi-market organisation.
Measuring What Matters: LLM-as-a-Judge
With test datasets in place, we needed a way to score quality at scale. Manual review of every response is neither practical nor sustainable, particularly when you want to iterate rapidly.
This is where the LLM-as-a-judge approach comes in. The concept is straightforward: you use a large language model to evaluate the outputs of your AI system, much like a human reviewer would, but at scale and with consistency. You define evaluation criteria (accuracy, helpfulness, safety, tone) and the judge LLM scores each response against those criteria.
This is not a perfect substitute for human judgment. LLM judges have their own biases and limitations. But calibrated against a set of human-reviewed examples, they provide a reliable, repeatable quality signal that enables something crucial: the ability to test hypotheses quickly.
The Power of Fast Iteration
This is where the real value of the framework becomes apparent. Once you have quality benchmarks based on structured test datasets, every change you make to the system becomes a measurable experiment rather than a guess.
If you tweak a prompt, does quality improve? Not just for the specific scenario you had in mind, but across the board? One of the more surprising findings from this project was how frequently a small prompt change that improved one dimension of quality would quietly degrade another. Without a comprehensive quality framework, those regressions would go unnoticed until a customer experienced them.
The same framework also made it straightforward to compare different foundation models. We could evaluate not just quality but also cost and latency side by side. In this case, we discovered that significantly cheaper models delivered almost equivalent quality, though with higher latency. That kind of insight is the starting point for meaningful AI cost optimisation. Decisions based on data rather than assumptions about which model is "best."
From Development to Production: Closing the Loop
The test datasets and quality benchmarks solve the development challenge. But what happens once the system is live and handling real user queries?
This is where the same observability techniques extend into production monitoring. The quality evaluation approach that worked on test data can be applied to live interactions: scoring output quality, measuring the relevance of retrieved context in RAG architectures, and flagging interactions that fall below quality thresholds.
This distinction matters. Guardrails, meaning rules that block or modify responses in real time, protect against acute risks: a toxic response, a data leak, a compliance violation. Observability protects against chronic risks: the gradual drift in quality that happens as user behaviour evolves, as context data changes, or as the underlying model is updated. You need both, but they serve different purposes.
Production observability also enables a feedback loop that continuously improves the system. Real-world interactions that score poorly become candidates for inclusion in test datasets, expanding the coverage of your quality benchmarks over time. The system gets better because you can see where it falls short.
Why This Matters for Your Business
The business case for AI observability is not abstract. It addresses three of the most common reasons AI projects fail to move beyond the prototype stage:
-
Confidence to go live. Observability gives leadership the evidence they need to approve production deployment. Rather than relying on anecdotal testing and gut feel, you can demonstrate quality scores across every dimension that matters to the business. You can show that compliance boundaries are enforced, that the system handles adversarial inputs appropriately, and that quality is consistent across languages and use cases.
-
Control over cost. By benchmarking different models against your specific quality requirements, you avoid overspending on the most expensive option when a cheaper alternative delivers equivalent results. As you gather production data, this cost optimisation becomes increasingly precise.
-
Continuous improvement without risk. The framework creates a safe environment for iteration. You can experiment with prompt changes, context strategies, and model updates with confidence that you will catch regressions before they reach users. This accelerates the pace of improvement while reducing the risk of shipping a change that makes things worse.
The Tooling: Langfuse
The observability framework we use is built on Langfuse, an open-source LLM observability and evaluation platform. It was recently acquired by ClickHouse in a deal that valued the combined company at $15 billion, a signal of how central this capability is becoming to the AI infrastructure stack.
We chose Langfuse for several practical reasons. It is open source and can be self-hosted, which matters for clients with data sovereignty requirements. The cloud version offers EU hosting. Pricing is realistic and transparent. And critically, it has good APIs that make it straightforward to create datasets programmatically, run evaluations, and pull data out for custom reporting and dashboards, important when the built-in UI is oriented more towards engineering teams than business stakeholders.
Getting Started
If you are building or running AI-powered applications and do not yet have a structured approach to quality measurement and observability, the single most impactful step you can take is to start. Define the quality dimensions that matter for your use case. Build a representative test dataset. Establish a baseline. Everything else, prompt optimisation, model selection, cost reduction, follows from there.
With non-deterministic systems, observability is not a luxury. It is a prerequisite for production.