The Client
A UK-based fashion retailer selling through their own website and two marketplaces. Monthly order volume: 1,800–2,400 orders. Support team: two part-time staff plus the founder handling overflow.
They came to us in January with a clear problem: support was consuming 35–40 hours per week across the team, and the volume was growing faster than they could manage. Their average email response time was 11 hours. Reviews were starting to mention slow support. The founder was spending Sunday evenings answering order queries instead of working on the business.
Their ask: build something that reduces the manual support load without making the customer experience worse. That second clause mattered a lot to them — and to us.
The Scope We Agreed
Before writing a line of code, we spent a week with the client mapping their actual support volume. They gave us access to their support inbox and we categorised every email received over the previous 30 days.
The results:
| Category | Volume | % of total |
|---|---|---|
| Order status / tracking | 312 | 34% |
| Return requests | 187 | 21% |
| Product questions (size, material, care) | 143 | 16% |
| Delivery issues (damaged, missing) | 96 | 11% |
| Account queries (login, address change) | 74 | 8% |
| Other / miscellaneous | 91 | 10% |
The first four categories — 82% of volume — had clear, automatable resolution paths. We scoped the agent to handle those. We left "other / miscellaneous" and any delivery issues requiring compensation to the human team.
Agreed success metric before build: 65% deflection rate within 90 days of launch.
What We Built
Integration Layer
The agent integrates with:
- Shopify — for order data, tracking information, customer details, and order status
- Royal Mail and DPD APIs — for real-time tracking data
- Return portal — their existing returns management system (a third-party tool) via API
- Gmail — reading inbound support emails and sending replies via their support address
The email integration was the most complex piece. We built a system that reads new emails in the support inbox, classifies the query type, retrieves relevant context from Shopify and courier APIs, generates a response, and either sends it automatically (for high-confidence resolutions) or drafts it for human review (for lower-confidence or policy-edge cases).
The Classification System
Every incoming email is classified before any response is attempted. The classifier uses the email subject, body, and any Shopify order data linked to the customer's email address to determine:
- Query category (from our taxonomy above)
- Confidence score (how certain we are about the classification)
- Recommended action (auto-respond, draft for review, escalate immediately)
Auto-respond threshold: 85%+ confidence on categories we know the agent handles well (order status, return eligibility). Below that, it drafts for human review. The threshold was deliberately conservative — we'd rather have a human approve a perfect response than send a confident wrong one.
Response Generation
For order status queries: the agent retrieves the order from Shopify, gets the current tracking status from the relevant courier API, and generates a response that includes the specific tracking link, last scan location, and estimated delivery date. The response is personalised — it uses the customer's name and references the specific items ordered.
For return requests: the agent checks the order date against the return policy (28 days from delivery), determines eligibility, and if eligible, generates a return label from the returns portal and sends it with instructions. If outside the window, it explains clearly and provides the contact for exception requests.
For product questions: the agent searches the product catalogue for the relevant item and answers from the product specifications. Size guide queries reference the actual measurement tables. Care instructions come from the product metadata.
Tone and Style Calibration
We spent more time on this than clients typically expect. The client had a distinct brand voice — warm, direct, slightly irreverent. Generic AI responses would have felt off-brand and undermined the customer experience they'd worked hard to build.
We gave the classifier 50 examples of good and bad responses from the existing support inbox, with annotations explaining what made each one good or bad. That informed the prompt design and gave us a reference set for tone evaluation during testing.
The Human Review Queue
Not everything is auto-sent. The agent drafts responses for:
- Classifications below the confidence threshold
- Return requests outside standard policy (partial returns, condition disputes)
- Delivery issues involving potential compensation
- Any email containing words indicating distress or dissatisfaction with the brand
The human team sees a queue in their inbox tool with draft responses pre-populated. For most drafts, they read, approve, and send in under 30 seconds. The workload shifts from reading-researching-writing to reviewing-and-approving.
What Broke in the First Three Weeks
Problem 1: Marketplace order numbers. Customers who ordered through the marketplaces (not the direct site) used marketplace order numbers in their emails. Our Shopify integration looked up orders by Shopify order ID or customer email. Marketplace order IDs didn't match.
Fix: Added a mapping layer that extracts marketplace order numbers from email body text and looks them up via the marketplace APIs.
Problem 2: Bundle product descriptions. Some orders included bundle items — three products listed as one SKU. The agent was describing the bundle SKU number when asked about specific items within the bundle, which was confusing for customers.
Fix: Updated the product catalogue mapping to expand bundle SKUs into their component products before the response generation step.
Problem 3: Overly formal tone on escalations. When the agent escalated a query to the human queue, it sent the customer a holding message. The initial version sounded corporate and cold — "Your enquiry has been received and will be addressed by a member of our team."
The founder flagged this immediately: "That sounds like it's from a bank, not us."
Fix: Rewrote the holding messages in the client's brand voice. Small change, big difference in how escalations felt to customers.
None of these were catastrophic, but all three reinforced something we already believed: shadow mode and tight monitoring in the first month catch the problems that demos never will.
The Results at 90 Days
| Metric | Before | After (90 days) | Change |
|---|---|---|---|
| Auto-resolved without human | ~5% | 71% | +66pp |
| Average first response time | 11.2 hours | 4.1 minutes | -98% |
| Weekly support hours (human) | 37–40 hours | 9–12 hours | -74% |
| Customer satisfaction (CSAT) | 3.9 / 5 | 4.5 / 5 | +15% |
| Negative reviews mentioning support | 3–4/month | 0–1/month | -75% |
We exceeded the agreed 65% deflection target by week eight (68%) and reached 71% by week twelve as we tuned the confidence thresholds and expanded product question coverage.
The CSAT improvement was the most surprising result. We'd expected deflection to improve satisfaction (faster responses) but not to that degree. Post-survey comments consistently mentioned speed — "got an answer in minutes, amazing" — and personalisation — "the reply actually referenced my specific order."
The founder's Sunday evenings are no longer spent in the support inbox.
The Cost and Payback
Build cost: £9,500
Monthly running cost: £220 (hosting, API costs, email processing)
Monthly maintenance retainer: £650 (weekly review, prompt updates as the product catalogue changes, integration maintenance)
Total first-year cost: £9,500 + (12 × £870) = £19,940
Value from time saved: 25–28 hours per week recovered at an effective rate of £18/hour = approximately £24,000 per year in recovered productive time.
Additional value: Reduction in negative reviews and associated brand damage. Not easily quantifiable but clearly meaningful for a direct-to-consumer brand.
Payback: approximately 5 months.
What We Would Do Differently
Start with better tone calibration. We got to the right place on tone, but the holding-message issue should have been caught in testing, not after launch. Tone review across every message type — not just the primary response — belongs on the pre-launch checklist.
Build the marketplace order lookup earlier. It was a predictable requirement given their sales channels. We should have scoped it into the build rather than retrofitting it in week two.
Set up monitoring dashboards on day one. We had logging from launch, but the client-facing dashboard took three weeks to build. The first three weeks of operational data were available but not easily visible to the client. Early visibility accelerates the tuning cycle and we know that now.
A fair caveat: this kind of result is realistic for businesses with clean order data and well-defined policies. If the underlying systems are messy — duplicate customers, inconsistent product data, undocumented exceptions — you'll spend half the project cleaning that up, and the deflection numbers come more slowly.
If This Sounds Like Your Business
The pattern we built — email triage, order lookup, return processing, auto-response with human review queue — is repeatable across e-commerce businesses at this scale. The specific integrations vary; the architecture is consistent.
Talk to us about your support volume — we'll map your current inbox against what the agent can handle and give you a realistic projection of what deflection rate you could expect. If the numbers don't work, we'll tell you.