Production as Your Testing Ground
The conventional wisdom in software development is simple: define requirements, build, test exhaustively, then ship. But LLM applications break this playbook.
LLMs are nondeterministic. The same input doesn't always produce the same output. This doesn't just make testing harder; it makes the entire concept of "requirements" slippery. There will be edge cases you simply won't imagine during development. There will be product behaviors that seem fine in meetings but make you cringe when you see them in the wild. There will be use cases you never anticipated because your users are more creative than your team.
And nothing reveals these gaps faster than shipping. I've seen deadlines slip by months as teams went from a pre-launch product review of a really solid tool all the way back to the drawing board because requirements were found during that product review.
But shipping without a plan is reckless. The answer isn't to spend months upfront trying to anticipate every edge case. The answer is to optimize for learning while live: design the core experience around the requirements you're confident about, establish clear decision processes for handling discoveries, and deploy monitoring that gives you visibility into what's actually happening.
The Three Pillars of Production-Ready Learning
1. Design Around Your Confident Requirements
You won't know everything. But you probably know something that you're confident about. Start there.
Build a golden dataset together. Product and engineering teams should collaborate to articulate what good looks like. This isn't a static benchmark, it's the beginning of institutional knowledge about what your application should do. Capture real examples, edge cases your team worries about, and variations that matter to your users. Throw this into a database or a spreadsheet and start tracking the application's performance on this. Add to this after launch!
Red team the application. Use humans or other AI systems to try to break what you've built. This isn't about catching all edge cases; it's about discovering the kinds of failures that are possible before your actual users stumble into them. The adversarial mindset reveals attack surfaces that standard QA never will. I've been especially impressed by Promptfoo in my modest experience with it for this use case.
Make guardrails easy to implement and update. Safety mechanisms shouldn't require a deploy cycle. Use tools and infrastructure that let you add constraints, exceptions, and behavioral rules without code changes. This means when you discover an issue at 2 AM, you can mitigate it at 2:05 AM. This also means that when you start adding functionality to a bot, you can carry over some guardrails.
Translate requirements into behavioral assertions. Stop testing for exact matches. Instead, define the properties that matter: "the response contains pricing information," "the tone is empathetic," "we don't mention competitors." LLM outputs will vary; test for what actually matters. As a litmus test for requirements: when the team finds some product behavior that feels concerning, you need to be able to write a few behavioral assertions describing the concern for it to be a requirement. Avoiding "bad vibes" is not a spec for a dev team.
2. Define Your Red Lines Before You Need Them
You will ship things knowing they're imperfect. That's not a failure of planning, that's you balancing business needs with performance. What matters is deciding in advance what imperfections you can live with and what problems require delay.
Articulate your stopping conditions. These should be concrete and actionable, not aspirational. "Poses medical harm," "violates regulatory requirements," "mentions competitor information." These are things developers can target. "Trust breaking" or "below our quality bar" are too vague to guide development and almost impossible to measure.
Err towards live mitigations. If you discover that your chatbot sometimes hallucinates product details, consider surfacing uncertainty to users and letting them confirm before acting, rather than blocking launch. If your summarization tool occasionally misses key points, add a human review step. Not every problem requires a delay, some require a containment strategy. Blocking a launch means holding off on valuable data collection. The live mitigation might give you data you need to find a better solution.
Use shadow launches to quantify the issue. Run your application in parallel to production, capturing what it would have said without showing users. This transforms a worry into data. How often does the problem actually occur? Does it affect certain topics disproportionately? Can you detect it programmatically? Only after you've answered these questions should you decide whether to delay or mitigate. An issue like undesirable tone affecting 1% of interactions is possibly not a launch blocker.
Add discoveries back to your behavioral assertions. Every requirement you discover in production gets retroactively added to your testing suite. Your testing gets smarter as your product does.
3. Learn Aggressively, Then Monitor Carefully
Production is where you get real data. The first 2-4 weeks after launch are your window to learn at velocity. After that, the focus shifts from rapid refinement to preventing regressions.
A. The Active Improvement Phase (First 2-4 Weeks)
This is when you should be obsessive about feedback loops. Set up daily or near-daily review cycles where you're actively hunting for patterns in user interactions.
Manually audit a wide sampling of conversations. You're not looking for perfection; you're looking for patterns that reveal misaligned requirements. Do you see users correcting the bot? Restarting conversations quickly? Asking follow-up clarifications? These human signals often appear in your data before metrics catch them. This is uncomfortable work (reading through real user failures), but it's also where you learn fastest.
Mine user corrections for immediate signal. When users rephrase questions, they're telling you the first response didn't work. Build a quick list of topics or question types where users are frequently rephrasing. If 30% of users ask a follow-up after asking about a specific topic, your answer isn't satisfying them. Flag these for quick iteration.
Identify canary inputs that matter. Work with product to define 10-20 critical queries your application should nail. "What are your business hours?" "How do I reset my password?" Run these manually every morning. When you're fumbling basic questions, something upstream has broken; and you'll know before your users complain.
Feed discoveries back into behavioral assertions immediately. Every surprising failure or unexpected pattern becomes a new behavioral assertion in your test suite. The goal is to build a living test suite that reflects what you've learned, so you don't regress on it.
Supplement this manual work with tools. That could be a few jupyter notebooks or even queries that you run as reports to help find problem areas, or something more mature and tailored to LLM applications, like Distributional, to help you monitor the application.
B. Steady-State Monitoring (Ongoing)
Once you've stabilized the core experience and patched the most obvious issues, shift to a monitoring posture that flags problems before they compound.
Track consistency across semantic clusters continuously. Group similar questions together: "What are your hours?" and "When are you open?" should receive substantively similar answers. If they start diverging, you have a consistency problem. Run this weekly; it catches drift before it becomes a pattern your users notice.
Monitor canary inputs on a predictable schedule. These shouldn't require daily manual review anymore; automate the execution and flag failures. These are your safety net, designed to catch upstream breakage quickly.
Maintain a regular audit cadence. Shift from daily review to weekly or bi-weekly. You're no longer hunting for unknown unknowns; you're watching for regressions and emerging patterns that your automated systems might miss.
The Bottom Line
Traditional software development assumes determinism: same input, same output, same environment. You plan comprehensively, test exhaustively, and ship when you're confident.
LLM applications demolish that playbook. You will never anticipate all edge cases. Requirements emerge in production. Real user behavior will surprise you; that's not a failure, it's how nondeterministic systems work.
The shift isn't to stop planning, it's to plan for learning. That means:
-
Design and test what you're confident about. Build a golden dataset, red team relentlessly, make guardrails easy to update, and translate vague concerns into testable assertions. Use tools like Promptfoo to systematically hunt for failure modes before users do.
-
Decide your red lines in advance. Define the concrete stopping conditions that block launch (medical harm, regulatory violations) versus the mitigations you'll accept live (surfacing uncertainty, adding review steps). Use shadow deployments to turn worries into data.
-
Obsessively monitor the first month, then establish rhythm. In the active improvement phase, audit conversations daily, hunt for patterns in user corrections, and feed everything back into your test suite. Once stabilized, shift to predictable cadences: automated canary checks, weekly consistency audits, semantic drift detection. Tools like Distributional can help surface problems you'd miss manually.
The organizations winning this transition ship imperfect, then they learn at velocity. Shipping isn't the end of testing. It's where testing actually begins.