The Demo Always Works
Every AI agent vendor has a demo, and every demo works. The agent answers fluently, handles the expected scenarios gracefully, and looks great in a forty-minute Zoom call. That's not because the vendor is dishonest — it's because demos are run in conditions the vendor controls.
Your job as the technical decision-maker is to evaluate what happens outside the demo. In production. Under real load. With real users asking unexpected things at 2am while OpenAI is having a partial outage and your CRM API is rate-limiting you. That's where AI agents either earn their keep or quietly burn budget while looking impressive on a dashboard.
This article is the technical conversation we'd want to have if we were on the buyer side: architecture, reliability, security, integration quality, and what it actually looks like to live with one of these systems after launch.
Architecture Questions to Ask
What is the fundamental architecture?
Most production AI agents today are some flavour of retrieval-augmented generation (RAG) wired to action-taking tools. The specifics will tell you a lot about whether the vendor has actually built this before.
Worth asking:
- How is the knowledge base structured and stored? Vector database? Which one? Why that one over the alternatives?
- How are tool calls implemented — native function calling, ReAct pattern, custom orchestration?
- How is conversation state managed across multi-turn interactions?
- What happens to context as conversations grow long? (This one catches people out. "We just send the whole transcript" works at small scale and falls over hard at production volume.)
A vendor who can answer these with specifics and deliberate reasoning — not "we just use the defaults" — has probably been here before. A vendor whose answer is "we use LangChain" without further detail has read the same tutorials as everyone else.
How does the system handle uncertainty?
Every AI agent encounters queries it can't answer confidently. What happens in those moments is the whole game.
Good systems detect uncertainty explicitly — confidence scores, self-evaluation steps, or model-graded checks — and escalate rather than guess. Bad systems hallucinate confidently, which is genuinely worse than not answering at all. A confidently wrong response to a customer question is more damaging than no response, because nobody knows to correct it.
Ask: "What does the agent do when it doesn't know?" and "How do you actually measure and detect low-confidence responses in production?" Vague answers here are the loudest signal you'll get all evaluation.
What model are you using and why?
Different LLMs have different performance profiles, cost curves, context window sizes, and rate-limit headaches. The right model depends on the use case, the volume, and the latency budget.
Be wary in two directions. Vendors who default to the most expensive frontier model for everything often haven't actually thought about cost-performance tradeoffs — and you'll find that out in the API bill. Vendors who use the cheapest model for everything are usually optimising for their own margin rather than your quality. The right answer is usually "we use Model X for the heavy reasoning steps and Model Y for the lightweight ones, and we re-evaluate quarterly." That's an engineering answer.
Reliability and Resilience Questions
What happens when the LLM API is down or slow?
OpenAI, Anthropic, and Google all have outages. We've watched all three of them. Your AI agent's reliability cannot be entirely dependent on a third-party API's uptime, particularly if it's customer-facing.
Production-grade systems have a fallback story: graceful degradation to simpler responses, queue-and-retry logic, or automatic failover to a secondary model provider. A system that just throws 500s when the LLM API has a bad afternoon is not production-ready, no matter how good the demo was.
How does the system handle rate limits?
LLM APIs have rate limits — requests per minute, tokens per day, etc. At low volume, irrelevant. At scale, hitting limits means failed requests, dropped customer interactions, and angry tickets.
Ask: do they have retry logic with proper backoff? Do they queue and prioritise? Have they negotiated higher limits with their provider, or implemented model routing to spread load? The right answer here is engineering-flavoured, not marketing-flavoured.
What is the latency profile?
"Fast enough" means different things in different contexts. A 3-second response to a support ticket is fine. A 3-second response in a real-time voice conversation is unusable.
Ask for actual numbers — p50, p95, p99 — not adjectives. P95 and p99 are where production quality lives. They tell you what your worst customer experiences look like, which is the only number that actually matters because your worst experiences are the ones that get screenshotted and posted on Twitter.
What is the error rate and how is it monitored?
Every production system has errors. What separates good systems from bad ones is whether anyone notices, and how fast.
Ask: "What's your typical error rate in production deployments?" and "How do errors get surfaced — to your team and to the client?" A vendor who can't quote rough error rates from past projects has either never run anything in production or isn't monitoring properly. Either way, it's a flag.
Security Questions
Who has access to the customer conversations?
Every conversation passing through a third-party LLM API is, by definition, transmitted to that provider. Depending on the agreement and the provider's terms at the time, that data may be used for training, may be retained for some window, and may be reviewable by provider employees under certain conditions.
For conversations containing PII, this is not a hypothetical concern. Ask specifically:
- Which LLM provider is used?
- What's the data processing agreement / business associate agreement situation?
- Is customer data used for model training? (For most enterprise OpenAI/Anthropic agreements, no — but verify in writing.)
- How long is conversation data retained, and where?
If the vendor can't answer these crisply, they haven't done the homework that your compliance team is going to ask about anyway.
How is the system protected against prompt injection?
Prompt injection — users trying to manipulate the agent into doing things it shouldn't — is a real attack vector and it's not theoretical. We've seen real attempts in real production logs.
Ask how the system handles adversarial inputs. If the answer is "it's well-tested," dig in. Specifics worth hearing: input sanitisation rules, output filtering, separation between user content and system instructions, sandboxed tool execution, and (for agents with write access) confirmation steps for high-stakes actions. A defence-in-depth answer is a good answer. A single-layer answer is a vulnerability.
What access does the agent have to your systems, and how is it controlled?
An agent that can act in your systems — update records, process transactions, send emails to customers — needs minimum-required permissions, full stop. Treat it like a service account, not a trusted human.
Review the permission model explicitly. If the agent has broader access than it strictly needs ("we just gave it admin so we didn't have to mess with scopes"), that's a security risk waiting to be exploited. Ask for a specific list: what it can read, what it can write, what actions it can trigger, and what humans approve.
Integration Quality Questions
How are integrations built and documented?
Good integrations use official APIs with proper auth (OAuth, not hardcoded credentials in a .env file someone shared on Slack), handle versioning and breaking changes, and ship with documentation your team can actually read.
Ask to see the integration code, or at minimum get an architectural walkthrough. Integrations built on webhooks and proper API clients hold up over years. Integrations cobbled together from scraping, unofficial endpoints, and three layers of custom middleware break every time anything upstream sneezes. We've inherited both kinds. The second kind is expensive to inherit.
What happens when an integrated system changes its API?
Third-party APIs change. Endpoints get deprecated. Auth flows update. What happens to your AI agent when Salesforce ships a breaking change in six months?
A robust integration has version pinning, deprecation monitoring, and a defined maintenance process. A fragile one breaks silently, gets discovered when a customer complaint surfaces, and gets fixed in a panic.
Who owns the maintenance of integrations?
Contractual question as much as technical. Get clarity upfront on who is responsible for keeping integrations alive over the years, what counts as routine maintenance versus a scope change, and at what cost. The cheapest projects are often the ones where this conversation was deferred to "later."
Maintainability and Ownership Questions
Who owns the code?
You. All of it. In a repository you control, with no proprietary platform components you can't export. If the vendor is building on something you can't take elsewhere, you are locked in for the life of the system, which is also the life of whatever future pricing changes they decide to make.
Get this in writing before work starts. Non-negotiable.
Can your team maintain it without the vendor?
Production AI agents need ongoing maintenance: knowledge base updates, prompt tuning, integration patches, monitoring. Your team should be able to do routine maintenance without being entirely dependent on the original vendor — even if you keep them on retainer for the bigger stuff.
Ask whether the system is documented, whether it uses standard tools and well-supported frameworks, and whether a competent engineer who didn't build it could pick it up and not lose their afternoon to a tour of bespoke abstractions. The truthful answer to that last question is the truest signal of code quality.
How is the system monitored in production?
You should have visibility into what the agent is doing — conversation volumes, error rates, escalation rates, response times, LLM API spend. If the vendor is the only one looking at the monitoring data, you have a visibility problem and a leverage problem at the same time.
Ask specifics: monitoring stack, dashboards, access, alerting rules, on-call expectations.
The Question That Reveals the Most
After all the technical questions, ask this one:
"What is the hardest production failure you've had with an AI agent, and how did you resolve it?"
A team with real production experience answers this immediately and with specifics. They have the story. Maybe an agent that went into a tool-calling loop and racked up an unexpected API bill. Maybe a prompt injection that caused the bot to leak system prompts. Maybe a latency spike during a viral traffic event that degraded service for an afternoon. Maybe an integration that broke silently and corrupted records for three days before anyone caught it.
A team without real production experience gives a vague answer or visibly struggles to remember a specific incident. That answer is more revealing than any pitch deck or demo. You're not hiring for an absence of failures — there's no such vendor. You're hiring for the relationship with failure.
We Answer These Questions Directly
When we engage with technical stakeholders, this level of scrutiny is what we expect — and frankly, what we prefer. We can walk through our integration architecture, discuss our security model in detail, share monitoring approaches, and put you in front of production deployments and the engineers who built them.
Talk to us about your business — bring the hard questions. We'd rather earn your confidence upfront than lose it in production.