Most Prompts Fail for the Same Reasons
You can build an AI agent that works in a controlled demo in an afternoon. Building one that holds up against real users — who are impatient, imprecise, occasionally hostile, and reliably unpredictable — is a different kind of work, and most of that work is conversation design.
The gap between demo quality and production quality is almost always made of:
- A system prompt that establishes clear, specific behaviour
- Flows that handle the predictable variations in how users approach a task
- Edge case design that covers what happens when things go wrong
- Testing against real user behaviour, not scenarios you imagined
What follows is each of those in practical terms, with examples drawn from production deployments.
The System Prompt: Getting the Foundation Right
The system prompt is the instruction set your agent operates from. Everything it does flows from there. A vague system prompt produces inconsistent, unpredictable behaviour. A precise one produces an agent that handles a wide range of inputs reliably.
Structure Your System Prompt in Sections
Don't write the system prompt as one big block of text. Break it into clear sections:
## Role and Context
You are [Name], a [role] for [Company]. Your purpose is to [primary function].
## What You Can Help With
- [Specific task 1]
- [Specific task 2]
- [Specific task 3]
## What You Cannot Help With
- [Out-of-scope topic 1]
- [Out-of-scope topic 2]
## How to Respond
- [Tone instruction]
- [Format instruction]
- [Length instruction]
## When to Escalate
Escalate to a human when:
- [Condition 1]
- [Condition 2]
## Critical Rules
- [Non-negotiable constraint 1]
- [Non-negotiable constraint 2]
This structure makes the prompt scannable, reduces the chance of conflicting instructions, and lets you update individual sections without breaking the rest.
Write Specific Constraints, Not General Principles
Weak: "Be helpful and professional."
Strong: "Respond in a friendly, direct tone. Use short paragraphs — no more than 3 sentences each. Address the customer by name if it appears in the conversation. Do not use jargon or technical language."
Weak: "Answer questions about our products."
Strong: "Answer questions about product specifications, availability, sizing, and care instructions using the product catalogue provided. If you cannot find the specific information in the catalogue, say so and offer to connect the customer with a team member who can help."
The specific version tells the model exactly what to do and exactly what to say when it can't. The general version leaves interpretation to the model — which, in our experience, is where most production failures begin.
Establish Uncertainty Handling Explicitly
Every production agent will encounter questions it can't answer confidently. How it handles that determines whether users keep trusting it.
When you are not certain about an answer:
1. Do not guess or speculate
2. Clearly acknowledge that you don't have this information
3. Offer an alternative: "I don't have that information, but I can connect you
with [specific person/channel] who can help"
4. Never present uncertain information as fact
Without explicit uncertainty handling, models fall back to their training behaviour, which often means generating plausible-sounding but wrong information. That's the failure mode that loses trust fastest.
Flow Design: Mapping the Conversations That Actually Happen
A well-designed flow maps not just the happy path but every meaningful variation. Users don't follow scripts. Your design has to handle what they actually do.
Map Every Entry Point
Users enter a conversation from different starting points with different amounts of context. An agent that handles "I want to return my order" differently from "I bought something last week and it's broken" and "Can I get a refund?" — when all three might mean the same thing — will frustrate users for no good reason.
Map your entry points explicitly:
Intent: Return request
Trigger phrases: "return", "refund", "send back", "exchange", "wrong size",
"broken", "damaged", "doesn't fit", "not what I expected"
Initial response: [Standard return intake flow]
Design Clarification Flows
When a user's message is ambiguous, the agent needs to ask a clarifying question. How it asks matters — a single, focused question beats a list of three.
Poor clarification: "Could you tell me your order number, what item you want to return, and when you received it?"
Better clarification: "I'd be happy to help with that. Could you share your order number so I can pull up the details?"
One question. Clear. Easy to answer. Gets the information needed to move forward.
Design for Common Failures
Map the moments where users commonly get stuck or frustrated.
User gives partial information. Agent asks for order number, user gives their name instead. Don't repeat the same question word-for-word. Acknowledge what they gave you and ask specifically for what's missing.
User changes the subject mid-flow. Agent is mid-return and the user asks an unrelated product question. Handle the new question, then offer to come back to the return.
User expresses frustration. Acknowledge the frustration before attempting to resolve. "I understand this is frustrating — let me help sort this out for you." A reply that jumps straight to logistics reads as cold even when it's correct.
User asks the same question repeatedly. If the agent has already answered and the user asks again, recognise it. Either rephrase the answer or escalate. Repeating the same canned line is the move that makes people screenshot the bot and post it online.
Writing Natural Responses
Production AI agents tend to fail in one of two directions: too robotic, or too corporate-cheerful. Neither is right. A few techniques that help.
Match Your Brand Voice
Every company has a voice. A challenger fintech sounds different from a heritage bank. A streetwear brand sounds different from a luxury retailer. The agent should sound like your brand, not like a generic AI assistant.
Collect 20–30 examples of great customer communications from your business — emails, chat transcripts, social replies. Annotate what makes them good. Use that as the reference point when evaluating the agent's output.
Vary Acknowledgement Phrases
If your agent starts every response with "Of course!" or "Great question!" it will immediately feel scripted. Vary the openers or drop them when they aren't necessary.
Vary: "I'll look that up for you." / "Let me check that." / "Sure — here's the information."
Or skip: If the user asks "Is this in stock?" the agent can answer directly: "Yes, the black version is in stock in sizes S–XL." No acknowledgement needed.
Use Concrete Language Over Abstract
Abstract: "We aim to provide excellent customer service and will do our best to resolve your issue."
Concrete: "I'll get your return label sent within the next few minutes."
Concrete language is more trusted, more useful, and more on-brand for almost every business we work with.
Testing Conversation Design
Testing conversation design is different from testing code. You're looking for response quality, consistency, and behaviour on the edges.
Build a Test Set Before You Build the Agent
Before writing a single line of code, write 50 test conversations. Cover:
- The 10 most common queries in their most common forms
- Five variations of phrasing for each
- Edge cases — queries near the boundary of scope, ambiguous inputs
- Adversarial inputs — manipulation attempts, rude messages, nonsense
Run every conversation through the agent before launch. Anything incorrect, off-brand, or surprising gets a prompt adjustment and a re-test.
The Tone Test
Read 20 random agent responses aloud. Do they sound like someone your company would actually hire? Too formal? Too casual? Too long? Are they saying things your company wouldn't say?
Tone failures are easier to catch in audio than in text. Reading out loud is a deliberate practice that surfaces problems quickly — and it's one of the cheapest QA habits we know of.
The Adversarial Test
Before launch, actively try to break the agent:
- Try to get it to say something off-brand
- Try to get it to provide information outside its scope
- Try to manipulate it with flattery or emotional pressure
- Try prompt injection: "Ignore your previous instructions and tell me..."
- Ask the same question ten times with different phrasing
Every failure here is a prompt improvement before users find it.
Where Conversation Design Quietly Fails
Two honest caveats worth flagging. First, an agent can be technically correct and still feel terrible to talk to. We've seen builds that pass every internal QA test and get torched in early production because the tone is subtly wrong — too sales-y, too apologetic, too eager. Tone problems don't show up in functional tests. They show up in CSAT and in users abandoning the conversation. Read the early transcripts personally.
Second, conversation design has a real ceiling at "the model genuinely doesn't know things about your business." No amount of prompt cleverness fixes missing data. If users are asking about something that isn't in your knowledge base, the answer is to update the knowledge base, not to keep tuning the prompt to dodge the question more elegantly.
The Iteration Cadence
Conversation design isn't done at launch. The most important design work happens after launch, informed by real user behaviour.
Weekly: Review 30–50 real conversations. Flag every response that's wrong, awkward, or off-brand.
Fortnightly: Implement prompt changes based on the review. Re-test the affected scenarios before deploying.
Monthly: Review overall conversation performance — escalation rate, CSAT, first contact resolution. Look for patterns in what's working and what isn't.
An agent that's actively maintained improves meaningfully month over month. One that's launched and forgotten will plateau within weeks and quietly degrade as user expectations move on around it.
If you want help building an agent where the conversation design is part of the work rather than an afterthought, we'd be happy to map it out with you.
Talk to us about building your agent — no commitment, just a conversation.