The API Call Is Not the Hard Part
The OpenAI documentation is excellent. Getting a response from GPT-4 takes about twelve lines of code. Most people who describe themselves as LLM developers have done at least this.
What separates a developer who has called the API from one who has built a production LLM application is everything that happens around the API call: how you get the right information into the context, how you structure the prompt to get consistent output, how you handle failure, how you evaluate whether the system is working correctly, and how you manage cost and latency at scale.
These are engineering problems that require genuine depth. This post explains what they are and why they matter.
What an LLM Developer Actually Does
Retrieval-Augmented Generation (RAG)
Most business LLM applications cannot just send a user's question to an LLM and hope it knows the answer. The LLM needs relevant information from your company's data — documents, policies, product catalogues, historical records.
RAG is the architecture for doing this. Documents are chunked into segments, converted into vector embeddings, stored in a vector database (Pinecone, Weaviate, pgvector, Chroma, or others), and retrieved based on semantic similarity to the user's query. The retrieved content is assembled into the LLM's context alongside the user's question.
Building a RAG system that retrieves accurately is significantly harder than it looks. Chunking strategy — how you divide documents — determines whether retrieval finds the right segments. Embedding model choice determines the quality of semantic matching. Reranking, filtering, and hybrid search (combining semantic and keyword search) are often necessary to get acceptable retrieval quality. Poorly built RAG systems return plausible-sounding but incorrect answers, which is worse than no AI at all.
Prompt Engineering
Getting reliable, structured output from an LLM requires precise prompt design. The difference between a prompt that works 80% of the time and one that works 99% of the time is often significant in production.
Good LLM developers know how to structure prompts to elicit consistent output formats, how to use system prompts to constrain model behaviour, how to handle edge cases, and how to use techniques like chain-of-thought prompting when reasoning quality matters.
They also know how to test prompts systematically — against representative inputs, edge cases, and adversarial examples — rather than tweaking until a few examples look good.
Output Parsing and Validation
LLM outputs are text. Business systems need structured data. Bridging these two requires robust output parsing — extracting structured information from free text — and validation — checking that the output is actually what was expected.
This means defining schemas for what the output should look like, writing parsing logic that handles variations in how the LLM formats its response, and validating that the extracted data makes sense before passing it to downstream systems.
Pydantic, function calling, and tool use in modern LLMs help with this significantly, but they do not eliminate the need for careful validation engineering.
Model Selection and Cost Management
GPT-4o is not always the right model. For many tasks, GPT-4o Mini, Claude Haiku, or Gemini Flash are significantly cheaper and fast enough that the latency difference matters for user experience. For some tasks, a fine-tuned smaller model outperforms GPT-4o at a fraction of the cost.
A good LLM developer thinks carefully about model selection: which model is required for quality, which is sufficient for cost, and where the trade-off falls for a given use case. At scale, model cost is a real operational expense.
Evaluation
How do you know if your LLM application is working? This is genuinely hard. LLM outputs are probabilistic and often subjective. "Does this response seem good?" does not scale to production systems that handle thousands of queries a day.
Production LLM applications need systematic evaluation: test sets of representative queries with expected outputs, automated metrics where they apply (factual accuracy, citation accuracy, output format adherence), human evaluation workflows for qualitative assessment, and regression tracking so you know when a model update or prompt change has degraded performance.
Building this evaluation infrastructure is often 20–30% of the engineering effort on a serious LLM project and frequently the part that gets skipped, leading to systems that seemed to work in testing and degraded silently in production.
Observability
LLM applications fail in ways that are hard to diagnose without good logging. A query that returns a wrong answer — why? What was retrieved? What was in the context window? What did the prompt look like? What did the raw model output look like before parsing?
Good LLM developers instrument their systems to capture this information for every request, making debugging a matter of inspection rather than guesswork. Tools like LangSmith, Weights & Biases, and custom logging pipelines serve this purpose.
The Difference Between Fine-Tuning and RAG
A common question from clients is whether to fine-tune a model or use RAG. The answer is usually RAG, at least initially, and for specific reasons:
Fine-tuning updates the model's weights using examples of desired behaviour. It is useful for teaching the model a consistent style, improving performance on a specific task type, or internalising a very large amount of information that cannot fit in a context window efficiently.
RAG retrieves relevant information at query time. It is easier to update (change the documents, not the model), more transparent (you can see what was retrieved), and handles dynamic information much better than fine-tuning.
Most business AI applications benefit more from good RAG than from fine-tuning, at least until they have enough usage data to identify where fine-tuning would meaningfully improve performance.
What to Ask an LLM Developer
- How do you structure retrieval for a large, heterogeneous document corpus? What chunking strategy do you use and why?
- How do you evaluate whether your RAG system is retrieving correctly? What does your test set look like?
- How do you handle hallucination? What do you do when the model does not have enough information to answer accurately?
- How do you manage LLM cost at scale?
- What does your observability stack look like?
These questions surface whether someone has built real systems or just called an API.
What We Build at Woyce
We build LLM applications for businesses — RAG pipelines, document processing workflows, conversational agents, and AI-powered features in web applications. We have shipped production systems, built evaluation infrastructure, and dealt with the failure modes that only appear under real usage.
Tell us what you are trying to build and we will tell you what the right approach is.