"It Seems to Be Working" Is Not Good Enough
Most businesses that deploy an AI agent spend a lot of time on the build and very little on measurement. The agent goes live, someone checks it now and then, and the general sense is that it's helping — but nobody really knows by how much, or where it's falling short.
That's how you end up with an agent that's been live for six months and is still making the same mistakes it made in week two. Problems that could have been caught and fixed in days quietly go unaddressed because nobody was watching the right numbers.
Measurement isn't optional. It's what separates an agent that gets better over time from one that stays mediocre indefinitely.
What follows is the metrics that actually matter — what they measure, why they matter, and what a healthy number looks like.
Category 1: Volume and Deflection Metrics
These tell you how much work the agent is handling and how much is still reaching your human team.
Total Conversation Volume
What it is: The total number of conversations the agent handles in a given period.
Why it matters: It's the baseline. Everything else is measured relative to this. A spike in volume might mean a product issue is driving more queries. A drop might mean the agent is being bypassed.
What to watch for: Unexpected changes — particularly drops — that might indicate the agent is failing silently or customers have started routing around it.
Deflection Rate
What it is: The percentage of conversations fully resolved by the agent without human involvement.
Why it matters: This is the headline ROI metric. If the agent is handling 65% of queries without escalation, that's 65% of that volume your team doesn't touch.
Healthy range: 55–75% for a well-scoped support agent after the first 90 days. Below 40% suggests scope or knowledge base problems. Above 80% in a support context sometimes indicates over-deflection — the agent is refusing to hand off when it should.
Trap to avoid: High deflection isn't automatically good. An agent that never escalates might be giving wrong answers and not knowing it. Track deflection alongside customer satisfaction, always.
Escalation Rate
What it is: The percentage of conversations transferred to a human agent.
Why it matters: Escalation is necessary and expected — not a failure. But the rate and the reasons tell you a lot. Escalations because a query is genuinely complex are healthy. Escalations because the agent doesn't know the answer to a common question point at knowledge base gaps.
What to track: Not just the rate, but the reasons. Categorise escalations by query type. Categories that consistently escalate are your top knowledge base priorities.
Category 2: Quality Metrics
These tell you whether the agent is giving good answers, not just whether it's answering at all.
First Contact Resolution Rate
What it is: The percentage of queries fully resolved in a single conversation — no follow-up needed.
Why it matters: A query that takes two or three interactions isn't actually being handled efficiently, even if the agent eventually gets there. High first contact resolution means answers are complete and accurate.
Healthy range: 70–85% for in-scope queries.
Accuracy Rate
What it is: The percentage of agent responses that are factually correct and policy-compliant.
How to measure it: Manual review of a conversation sample — typically 50–100 per week in the first 90 days, then monthly. Each response gets marked accurate, inaccurate, or partially accurate.
Why it matters: This is the hardest metric to measure at scale, which is exactly why most teams skip it. It's also the most important. An agent with a 70% deflection rate and a 20% inaccuracy rate is worse than no agent — it's confidently misleading customers at volume.
Healthy range: 90%+ for in-scope queries. Below 85% needs an immediate knowledge base review.
Containment Rate vs Resolution Rate
The distinction: Containment means the customer stayed in the agent conversation (didn't immediately ask for a human). Resolution means the query was actually solved. These are different.
An agent can have high containment but low resolution — customers complete the conversation but their problem isn't solved. That shows up in repeat contacts and in satisfaction scores.
Track both. The gap between them tells you whether the agent is genuinely helping or just occupying customers until they give up.
Category 3: Customer Experience Metrics
These tell you how customers feel about the interaction.
Customer Satisfaction Score (CSAT)
What it is: A post-conversation rating, typically 1–5 stars or a thumbs up/down, sent automatically after the conversation closes.
Why it matters: Direct customer feedback on whether the interaction was useful. This is your check on whether deflection is real resolution or just avoidance.
Healthy range: 4.0+ out of 5 for a well-calibrated agent. Below 3.5 means the agent isn't solving problems, even if it's deflecting them.
Implementation note: Keep the survey short — one question with an optional comment. Response rates fall off a cliff as you add fields.
Negative Sentiment Rate
What it is: The percentage of conversations where the customer expresses frustration, dissatisfaction, or anger during the interaction.
How to measure it: Modern AI monitoring tools can detect negative sentiment automatically across all conversations — no manual review required.
Why it matters: Catches problems that don't show up in CSAT because frustrated customers often don't complete the survey. A spike in negative sentiment is an early warning that something is wrong.
Repeat Contact Rate
What it is: The percentage of customers who contact the agent again within 48 hours about the same issue.
Why it matters: If someone comes back with the same query two days later, the first interaction didn't actually resolve it. High repeat contact rates mean answers are incomplete or inaccurate.
Healthy range: Under 12% for in-scope queries.
Category 4: Operational Metrics
These tell you about the agent's technical performance and efficiency.
Average Response Time
What it is: The time between the customer sending a message and the agent responding.
Healthy range: Under 3 seconds for text agents. Under 1 second is excellent. Above 5 seconds starts to feel noticeably slow to users.
What causes slowness: Slow API responses, complex retrieval queries, large context windows being processed. A spike in response time often points to an infrastructure or cost issue.
Knowledge Base Hit Rate
What it is: The percentage of queries for which the retrieval system actually found relevant content in the knowledge base.
Why it matters: When retrieval fails — when no relevant content is found — the agent either guesses (bad) or says it doesn't know and escalates (acceptable). Low hit rates indicate gaps in the knowledge base.
Healthy range: 80%+ for in-scope query types. Gaps in specific categories point to specific content that's missing.
Cost Per Conversation
What it is: Total operating cost (LLM API calls, hosting, infrastructure) divided by conversation volume.
Why it matters: AI agents are cheap to run but not free. As volume scales, cost per conversation should stay flat or decrease. A rising cost per conversation often means inefficient prompting or unnecessarily large context windows.
Typical range: $0.02–$0.15 per conversation depending on complexity and LLM provider. Well-optimised agents stay at the low end.
Where These Metrics Mislead You
Two honest caveats before you build a dashboard around any of this. First, every metric on this list can be gamed if it becomes the only one a team chases. Push deflection rate hard enough and the agent will stop escalating things it should. Push containment and the agent will hold conversations hostage. Push accuracy by narrowing scope until everything is easy. The metrics are only useful as a set — no single one tells you the agent is working.
Second, the early-period numbers are noisy. In the first two or three weeks, you don't have enough conversations to make real claims about CSAT or accuracy trends. We've watched teams panic at a 3.2 CSAT in week one based on twelve responses, redesign the prompt, and then watch the number jump to 4.4 in week three from organic volume. Resist the urge to over-tune in the first month. Set targets, then give the agent enough data to actually evaluate them against.
The Dashboard You Actually Need
A practical monitoring setup doesn't need a sophisticated analytics platform. The minimum viable dashboard:
| Metric | Frequency | Alert threshold |
|---|---|---|
| Deflection rate | Daily | Drop of 10%+ week-on-week |
| CSAT score | Weekly | Below 3.8 |
| Accuracy rate (sampled) | Weekly | Below 88% |
| Escalation categories | Weekly | Any category >20% of escalations |
| Repeat contact rate | Weekly | Above 15% |
| Response time | Daily | Above 4 seconds average |
| Knowledge base hit rate | Weekly | Below 75% |
Review this weekly in the first 90 days. Monthly after that, with automated alerts for threshold breaches.
The Review Cycle That Keeps Improving Performance
Metrics are only useful if you act on them. Build a simple review process:
Weekly (first 90 days): Review the dashboard, read a sample of 20–30 conversations (including every negative-CSAT interaction), identify the top three recurring failure patterns, update the knowledge base for any gaps.
Monthly (ongoing): Review trend lines across all key metrics, decide whether scope should expand or contract, review the escalation breakdown for new knowledge base gaps, update configuration based on product or policy changes.
Quarterly: Assess overall ROI against the original business case, consider whether new use cases justify expansion, review what's changed in the broader AI landscape that might be worth adopting.
An agent reviewed and tuned on this cadence will perform meaningfully better at six months than one deployed and forgotten.
Ready to Build an Agent You Can Actually Measure?
We build measurement and monitoring into every AI agent project from the start, not bolted on at the end. The agent goes live with a dashboard, a review process, and targets you've already agreed on.
If you want to see what that could look like for your situation — and which metrics probably matter most for your use case — we'll map it out with you.
Talk to us about your business — no commitment, just a conversation.