Sia HackewrNoon

What Actually Works in Production

A practical framework for choosing your LLM optimisation strategy

Introduction: The Great Optimisation Divide

The combination of Large Language Models (LLMs) and team building presents a common dilemma in the development of systems using LLM technology: Is it better to conduct a fine-tuning of the model or to design better prompts? The industry has falsely simplified a complex problem. In my case, after designing and implementing numerous production systems, the solution has never been fine-tuning LLMs OR prompt engineering techniques. It's recognising when to apply which method, and many times, integrating a fine-tuning with prompt engineering.

The article won't address the engineering and technical details. Rather, it focuses on the pragmatic aspects of the solution when talking about customers waiting, budgets being limited, and time being of the essence.

The Promise and the Reality

Most people find prompt engineering too good to be true, without the need to invest in infrastructure or experience costs and obtain results immediately. Imagine spending hours crafting the perfect system message in hopes of obtaining perfect domain understanding from a model. If only.

Now, consider the time and effort involved in the process of fine-tuning, it's too 'scientific' for many, involving data collection, model retraining, and the deployment of a customised system. Nothing about this seemingly long-winded process can be better... or can it? Yes, in fact, the minimal gains of prompt engineering versus fine-tuning are not only true, but are also the dimensions that engineering, fine-tuning, and prompt engineering are too often questioned for in professional settings.

When Prompt Engineering Actually Wins

1. You're Building Fast and Iterating: Need to launch in weeks, not months? Prompt engineering wins speed unparalleled. If you're looking for a quick solution to about 80% of your problems, prompt engineering your system message and implementing examples and reasoning will resolve your problems quickly.

2. Your Task Varies Across Contexts: Explain to me why fine-tuning would not be a maintainability nightmare if I’m tasked with reviewing legal documents or providing customer support in different jurisdictions and industries. Variability is the issue on which prompt engineering thrives, and fine-tuning falls short.

3. You Need to Adapt Frequently: With prompt engineering, your system message is live in a matter of minutes, while fine-tuning leaves you waiting weeks for retraining processes of your systems. Quality is frustratingly useful, but in production settings, flexibility is the most useful feature.

4. Your Data is Proprietary or Sensitive: Certain organisations are unable to share data with third-party APIs for fine-tuning. Others are unable to handle the cost associated with the maintenance of customised models. Data protection through prompt engineering is secured by fine-tuning your data and logs. This is security and compliance.

The Hybrid Reality: What Actually Works

In the production systems I have encountered, the answer is both. The model is

1. Start with Prompt Engineering. To begin, swiftly deploy with a system prompt, examples, and reasoning patterns that are well-structured. This should address 70-80% of your concerns. Then, measure everything.

2. Define Your Failure Modes. Do not use guesswork. Analyse your errors. Is the model failing on the format (then fine-tuning might assist)? Is it failing on the context (then prompt engineering might assist)? Is it failing on the trade-offs you have to make (then either could assist)? Be ruthless in your prioritisation.

3. Data Collection Should be Focused on Your High-Impact Failures. Only failures that concern you the most matter. If you are 95% accurate already, then you likely do not require fine-tuning. If your failures are erratic, then prompt engineering might not help. Construct your dataset wisely.

4. Fine-Tune for Your Designated Patterns. After you have 200+ high-quality examples, fine-tuning will be economically feasible. Combine it with your most effective prompts from step 1.

5. Assess the Improvement. Conduct A/B testing. Compare a fine-tuned model with a great prompt against the base model with a great prompt. If the improvement isn't worth the trade-offs in infrastructure and latency, then continue with prompt engineering.

The Economics: Where the Rubber Meets the Road

Fine-tuning requires an additional expense beyond the initial cost of training. There are ongoing costs related to infrastructure: model hosting, version control, A/B testing, and the mental strain that comes with juggling multiple models.

In addition, prompt engineering is time intensive. There is the time to construct and iterate on prompts, the time to update example dataset repositories, and the time that those examples add to each request.

As a company grows, the following paradigms are what I've observed:

Scenario: Winner — Why
< 100M inferences/month: Prompt Engineering — Fixed cost infrastructure dominates
100M - 1B inferences/month: Hybrid — Fine-tuning cost per inference becomes significant
1B inferences/month: Fine-Tuning — Unit economics favour optimised model
Frequent changes needed: Prompt Engineering — Update latency matters more than inference cost
Locked-in behaviour needed: Fine-Tuning — Cannot achieve with prompts alone
Required changes with high frequency:
Prompt Engineering — Latency to update the model matters more than the cost of inference.
Required behaviours that are 'locked-in':* Fine-Tuning — This cannot be achieved with prompts alone.

Common Mistakes I See Every Day

Error 1: Baseline absent after fine-tuning: Teams will spend extensive amounts of time refining data and tuning the system without recognising the need for baseline objectives for their prompts. Establish a baseline for prompt engineering in order to justify fine-tuning.
Error 2: Misalignment of training data and the intended production: The fine-tuned models' training data is misaligned with your production model and the historical training logs. Continuously check that the training data aligns with the current expectations of the model.
Error 3: Failure to address distribution shift: Not considering the distribution shift. A model trained for data for January will not perform at the required standard for a shift to March data. The fine-tuned model will require a wired system. More than prompt engineering is required here.
Error 4: Optimising for the wrong metric: A prompt that is improving an F1 score is at the expense of latency. Your prompts become lengthier. Be concerned with what is of importance to the users and the production system: cost, speed, accuracy, and user satisfaction, rather than focusing on the target benchmark for your prompts.
Error 5: A single instance of prompt engineering: The best prompts will put in place a required system of control for them to be flexible and be allowed to shift with the data, the evolution of your understanding of the problem at hand, and the evolution of your business needs. Give your prompts a new face through testing them and A/B them out in the production space.

What's Changing

Three catalysts are driving changes in the calculus.

Firstly, advancements in prompt engineering. The ability to use more advanced construction systems, such as retrieval-augmented generation (RAG) and function calling, as well as other prompt techniques, means that fine-tuning is no longer necessary to the same level. The differences between prompt techniques and fine-tuning are quickly evaporating.

Second, inference costs are increasingly becoming cheaper. Depending on the type of inference, fine-tuning as an economically viable option may not be possible. The ability to add examples, context, and so on, to each prompt assists in making inference costs cheaper.

Finally, increasing model intelligence. The ability to request greater compliance and robustness to prompt variations improves instruction compliance. Models suffer fewer losses from fewer example patterns. The advantages are dissociated from fine-tuning.

It is evident that fine-tuning is not obsolete but is more specialised and narrow in its application. The majority of teams should view prompt engineering as the foremost technique.

Conclusion: Start Simple, Scale Deliberately

My most successful teams start with prompt engineering and focused measuring, moving to fine-tuning only when it is economically and data justified. They don't consider these as rival tools but rather view each as a means to solve a problem with optimally balanced trade-offs.

To manage a system, you must understand and can troubleshoot. That's what prompt engineering allows you to do. Next, you should focus on assessing what exactly is broken. The data captures what you need to address. After this, focus on resolving issues that are important to your users.

For that last step, fine-tuning is appropriate. Not for the first step. Not for the obvious first step. But for the right step, at the right time, and with the right data.