The Market Is Noisy and the Stakes Are Real
AI agent development is a crowded market right now, and every vendor's pitch sounds basically the same. Polished proposals, impressive demos, the same set of confident claims about "production-grade systems" and "end-to-end automation." Telling genuine capability apart from well-packaged mediocrity is genuinely hard — and you usually only find out which one you've bought several months and a lot of money in.
The stakes are real. A bad vendor choice costs you the budget, the timeline, and — if the agent ever made it in front of users — some of your customers' trust. A good choice gets you a production system that improves over time and a relationship you can keep building on. The difference between those two outcomes is almost entirely about how seriously you ran the evaluation up front.
What follows is the framework we'd use if we were on the buyer side. None of it is exotic, but most of it gets skipped under deadline pressure, and that's how the wrong vendor wins.
Phase 1: Qualification (Before You Request Proposals)
Define Your Requirements First
Before approaching any vendor, write a clear requirements document. What are you trying to build? What does it need to do? What systems does it need to touch? What does success look like, in numbers?
This sounds obvious. Almost nobody does it. The typical first vendor conversation goes "we want an AI agent for our customer support," and then the rest of the evaluation drifts because nobody anchored it to a spec. When you give every vendor the same written brief, two useful things happen: the proposals get sharper, and you finally have an apples-to-apples basis for comparison. Without the brief, every proposal is shaped by what that vendor wants to sell you. With it, you start to see who actually read the document.
Minimum Qualification Criteria
Before you invest time in a deep evaluation, screen vendors against these:
Production references. Can they show you an AI agent currently running in production — not a demo video, not a case study, a live system you can poke at? If no, remove from the list. There is too much demo-grade work being sold as production right now.
Relevant technical stack. Are they building with the right tools for your use case? A team whose entire body of work is GPT-wrapper chatbots is not ready to build a multi-integration agent on your CRM. Skill transfers, but not as much as vendor websites would suggest.
Appropriate scale. A solo freelancer and a 200-person consultancy have very different risk profiles. Neither is universally better; they suit different projects. Match scale to project, and be honest about which one you actually need.
Compliance awareness. If you're in financial services, health, legal, or public sector, has the vendor actually built in a regulated context? Can they describe the compliance implications of your use case without Googling them mid-call? This one weeds out a surprising number of contenders.
Phase 2: Structured Assessment
The RFP (Request for Proposal)
Send a structured brief to three to five qualified vendors — not more, you'll drown in proposals you can't seriously compare. The brief should include:
- Background on the business and the problem.
- A clear scope definition: what the agent will do, and just as importantly, what it won't.
- Integration requirements.
- Success metrics.
- Timeline and budget range. (Yes, share the budget. Vendors who hide their numbers because you hid yours are wasting everyone's time.)
- What you expect in the proposal.
Ask every vendor to answer the same questions in the same format. The questions we'd require:
- Describe your proposed technical architecture. Which LLM? Which retrieval approach? How will you handle integrations?
- What success metrics will you commit to, and how will they be measured?
- What's included in post-launch support, and what triggers additional billing?
- Who, specifically by name, will work on this project?
- What does your QA process look like before launch?
- Provide three references for similar projects we can contact directly.
If any vendor's proposal skips a question or gives a non-answer, treat that as the answer.
Evaluating the Proposals
Score each proposal on these dimensions. The percentages are a starting point — adjust to your project, but be explicit about it.
Technical specificity (25%): Does the proposal describe a specific architecture with reasoning behind the choices? Or is it generic AI-flavoured marketing? "We use advanced AI techniques to deliver scalable solutions" is not an architecture.
Success metric definition (20%): Have they defined specific, measurable success criteria? Or is it vague language about "improving efficiency"? The willingness to commit to metrics in writing before a contract is signed is one of the strongest signals of confidence you'll see.
Reference quality (20%): Do the references actually match your use case? And — critically — have you contacted them yourself and asked specific questions?
Team transparency (15%): Do they name the people who will actually do the work? Have you assessed those individuals, or just the salesperson?
Post-launch clarity (10%): Is it crystal clear what's included after launch, for how long, at what cost?
Commercial reasonableness (10%): Is the pricing in the right ballpark for the scope? Both the cheapest and the most expensive proposal deserve a second look. The cheapest is often dramatically underscoped (you'll find out about the change orders in month two). The most expensive is sometimes padding for a generic engagement that won't take that long.
The Reference Call
References are the single most reliable signal in vendor evaluation, full stop — and they're the step buyers most often skip because it's awkward to ask for them and even more awkward to actually call. Do it anyway. The 30 minutes will save you 30 weeks.
When you do the call:
- Ask about the process, not just the outcome. "What was it like to work with them when something went wrong?" tells you more than "were you happy with the result?" Everyone is happy with the result in week one.
- Ask about specifics. "What did they actually build for you? Which integrations? What was the hardest part?" Generic praise is worthless. Specific stories are gold.
- Ask about the relationship after launch. "How responsive have they been since the build?" The vendor who is brilliant during the sales cycle and unreachable after the invoice clears is a common pattern.
- Ask the uncomfortable question. "Is there anything you'd do differently about choosing this vendor?" The pause before the answer often tells you more than the answer.
Three references, each contacted directly, tell you more than a portfolio of polished case studies ever will.
The Technical Interview
For shortlisted vendors, run a 45-minute technical interview with the person who will actually build your project. Not the salesperson. Not the account manager. The person whose hands will be on the code. If the vendor won't put that person on a call, that is the signal.
Questions worth asking:
- Walk me through how you'd architect this agent specifically. What would you use for retrieval, and why that choice over the alternatives?
- What are the main failure modes for a project like this, and how would you mitigate them?
- How have you handled [the trickiest integration in your stack] before?
- What would cause you to recommend against building this?
- If the agent underperforms after launch, what's your process?
You're listening for specificity and confident opinions. A technically strong developer has views and can defend them. A technically weak one gives plausible-sounding generic answers and agrees with whatever you suggest — which feels great in the moment and terrible in month three.
Phase 3: Decision and Negotiation
Scoring and Weighting
Aggregate scores from the RFP evaluation, the reference calls, and the technical interview. If you have multiple evaluators (you should), score independently and then compare. Group consensus that emerges in real time is mostly the loudest person's opinion repeated back.
Be honest about your weights. A business-critical production system should weight references and technical specificity heavily. A throwaway prototype can weight speed and cost more. Document the weighting before you see the scores, so you're not unconsciously moving the goalposts to justify the vendor you already liked.
The Contract Negotiation
Before signing anything, get explicit written agreement on:
IP ownership. All code, prompts, and documentation are yours upon payment. No exceptions, no "platform components retained by vendor," no licensing trapdoors.
Success metrics. The agreed metrics and how they'll be measured go in the contract. What happens if they aren't hit? An open conversation here is healthy; silence is a warning.
Scope change process. How are scope changes assessed and priced? What counts as a scope change versus a bug fix? This is where many engagements quietly die.
Post-launch obligations. What's included after launch, for how long, at what cost? "We'll be there if you need us" is not a clause.
Data handling. Where does your data go, who can access it, what are the retention and deletion obligations? Especially important if there's anything sensitive moving through the agent.
Termination rights. What if you need to walk away mid-project? What do you get at termination — code, documentation, prompts, weights? Settle this before you need it.
Common Evaluation Mistakes
A handful of patterns we've watched play out, more than once:
Choosing on price alone. The correlation between lowest price and best outcome is roughly inverse. A vendor who wins on price has either underscoped the work or undervalued their own capability — and you'll discover which one in month two when the change orders start arriving.
Being impressed by the demo. Every vendor has a demo that works. Demos tell you what they can build in controlled conditions on a stage where they control everything. References tell you what they ship into production where they don't. Trust the latter.
Not involving technical stakeholders. If your evaluation team has no technical people, borrow one. An engineer reviewing another engineer's proposal sees things a non-technical buyer can't, and vendors know it.
Rushing the decision. A real evaluation takes two to three weeks. Compressing it to a week to "move fast" usually costs you three to six months of rework later. We've seen this play out more times than we'd like.
Not checking who will actually do the work. The person who presents the proposal and the person who writes the code are very often not the same person. The proposal A-team can quietly become a B-team after you sign. Ask. Verify. Get the names in the contract if you can.
About Our Own Evaluation
We wrote this guide because clients who run proper evaluation processes consistently end up with better projects — regardless of whether they choose us at the end of it. A structured evaluation surfaces actual capability clearly, which we're fine with.
If you do request a proposal from Woyce, expect us to answer every question in this framework directly, give you references you can call, and put the specific people who'll work on your project in the technical interview. If a different vendor scores higher than us on the criteria that matter to your project, hire them. That's how the process is supposed to work.
Start an evaluation — send us your requirements document and we'll come back with a specific proposal.