You don’t always need a new model. Most of the time, you just need better examples.
Large language models aren’t “smart” in the human sense; they’re pattern addicts. Give them the right patterns in the prompt, and they start behaving like a domain expert you never trained.
This is basically what Few‑Shot In‑Context Learning (ICL) is about: instead of retraining or fine‑tuning the model, you inject knowledge directly through carefully crafted examples in the prompt.
In this piece, we’ll walk through:
- What “knowledge injection” really means in practice
- How Few‑Shot‑in‑Context sits between “do nothing” and “full fine‑tune”
- A concrete mental model of how LLMs learn from examples
- Five battle‑tested design principles for good few‑shot prompts
- Cross‑industry examples (medical, finance, programming, education, law)
- Six common failure modes — and how to debug them
- Where this technique is going next (RAG, multi‑modal, personal knowledge, etc.)
If you’re tired of hearing “we need to fine‑tune a custom model” every time someone wants to add new knowledge… this article is for you.
1. Why “knowledge injection” matters (and where Few‑Shot ICL fits)
Most real‑world LLM problems boil down to one painful fact:
The model doesn’t actually know what you care about.
There are two big gaps:
- Fresh knowledge Anything that happened after the model’s training cutoff:
- 2024–2025 policy changes
- New internal processes
- Product updates, pricing changes, API deprecations…
- Niche / long‑tail knowledge Things that were never common enough to appear in large quantities online:
- A tiny vertical’s industry standard
- Rare diseases’ treatment guidelines
- Your company’s weird internal naming conventions
The classic response is:
“Let’s fine‑tune a model.”
Which sounds cool until you realise what that implies:
- Curate + clean + label a dataset
- Pay for training (GPU hours, infra, dev time)
- Wait days or weeks
- Repeat whenever something changes
For many teams (and almost all solo builders), that’s overkill.
Enter Few‑Shot‑in‑Context
Few‑Shot‑in‑Context Learning = you don’t touch model weights. You just:
- Pick 3–5 good examples that:
- Embed the knowledge you care about
- Match the task format you want
- Stick them into the prompt before the actual query
- Let the model infer the pattern and apply it to new inputs
You’ve just “fine‑tuned” the model’s behaviour… without training anything.
Example: suppose you want the model to answer questions about a 2025 EV subsidy policy that it’s never seen.
Instead of fine‑tuning, you can drop something like this into the prompt:
Task: Decide if a car model qualifies for the 2025 EV subsidy and explain why.
Example 1
Input: Battery EV, 500km range
Output: Eligible. Reason: battery electric vehicle with range ≥ 500km gets a 15,000 GBP subsidy.
Example 2
Input: Plug‑in hybrid, 520km range
Output: Eligible. Reason: plug‑in hybrid with range ≥ 500km also gets 15,000 GBP.
Example 3
Input: Battery EV, 480km range
Output: Not eligible. Reason: 480km < 500km threshold, so no subsidy.
Now solve this:
Input: Battery EV, 530km range
Output:
From these 3 examples, the model can infer the rule:
“Type ∈ {BEV, PHEV} AND range ≥ 500km → 15k subsidy; otherwise 0.”
That’s knowledge injection via examples.
2. How Few‑Shot‑in‑Context actually works (mental model)
To understand why this works, forget about gradient descent and transformers for a second and think in more human terms.
When you few‑shot an LLM, it roughly goes through three internal phases:
2.1 Example parsing: “What’s going on here?”
The model scans your prompt and:
- Detects the task pattern: “Ah, it’s always
Task → Input → Output.” - Extracts knowledge bits: “EV subsidy depends on:
- type: BEV / PHEV vs fuel car
- range: threshold 500km
- standard subsidy amount: 15k”
It’s not memorising the car models; it’s capturing the relationship between input features and output labels.
2.2 Pattern induction: “What’s the general rule?”
Given multiple examples, the model tries to generalise:
- Example A: BEV, 500km → 15k
- Example B: PHEV, 520km → 15k
- Example C: BEV, 480km → 0
- Example D: Fuel car, 600km → 0
It can then infer something like:
“Subsidy depends on being a new energy vehicle AND hitting the range threshold. Fuel cars never qualify.”
This is the “few‑shot learning” part: very little data, but high‑signal structure.
2.3 Task transfer: “Apply that rule over here”
When the user sends a new query — say:
Input: PHEV, 510km
The model doesn’t go “I’ve seen this exact string before”. It goes: “This matches the pattern → I know the rule → apply → 15k subsidy.”
This is the crucial distinction:
Good few‑shot prompts teach the model the logic, not just the answer.
If your examples are just random Q&A with no visible structure, the model may end up parroting instead of reasoning. The rest of this article is about avoiding that.
3. Five design principles for Few‑Shot knowledge injection
You can’t just throw examples at the model and hope for the best. The quality of your examples is 90% of the game.
Here are five principles I keep coming back to.
3.1 Cover both core dimensions and edge cases
Real rules always have:
- Normal cases (most inputs)
- Edge cases (boundary conditions, exclusions, weird situations)
Your examples should cover both.
Bad pattern (only “happy paths”):
Example 1: BEV, 500km → subsidy
Example 2: PHEV, 520km → subsidy
The model might incorrectly infer:
“If it’s an EV and the number looks big-ish, say ‘subsidy’.”
Better pattern (explicit edges):
Example 1: BEV, 500km → eligible (base case)
Example 2: PHEV, 520km → eligible (second base case)
Example 3: BEV, 480km → not eligible (range too low)
Example 4: Fuel car, 600km → not eligible (wrong type)
Now the model has:
- Dimensions: drive type, range
- Edges:
- Range < threshold
- Non‑EV types
Design rule:
For any rule you’re encoding, ask: “What are the obvious ‘no’ cases? Did I show at least one of each?”
3.2 Keep the format perfectly consistent
Models are surprisingly picky about format. If your examples look one way and your “real” task looks another, performance tanks.
You need consistency at three levels:
- Structure — same high‑level layout e.g. always
Task → Input → Output, in that order. - Terminology — same words for the same concepts Don’t alternate between
EV,electric car,new energy carunless you really have to. - Output shape — same style of answer e.g. always “Yes/No + Reason”, or always a JSON object, or always a 3‑bullet explanation.
Inconsistent example (what not to do):
Example 1: “2025 EV subsidy: BEV A, 500km → 15k.” (terse)
Example 2: “According to the 2025 regulation, model B qualifies for…”
Task: “Explain whether model C qualifies and justify your reasoning.”
The model has to guess which style to imitate.
Consistent example (what you want):
Task: Decide if the car qualifies for subsidy and explain why.
Example 1
Input: Type=BEV, Range=500km
Output: Eligible.
Reason: Battery EV with range ≥ 500km meets the threshold; base subsidy is 15,000 GBP.
Example 2
Input: Type=PHEV, Range=490km
Output: Not eligible.
Reason: Plug‑in hybrid but range 490km < 500km threshold, so no subsidy.
Now solve this:
Input: Type=BEV, Range=530km
Output:
Everything lines up, so the model can just continue the pattern.
3.3 Sample count: 3–5 examples is usually enough
This one is empirical but holds up annoyingly well:
Less than 3 examples → shaky generalisation, easy overfitting More than 5–6 examples → diminishing returns + wasted context
Why not 20 examples? Because:
- LLMs have a context window; you pay in tokens
- Extra examples can actually dilute the pattern if they’re noisy
- You often don’t have 20 high‑quality labelled examples for a new rule
The sweet spot for most tasks is:
- 3–4 examples for simple rules
- 4–5 examples when you need to show edge cases + 1–2 tricky combos
If you feel you need 12 examples, it’s often a smell that:
- You haven’t factored the rule cleanly, or
- You’re mixing multiple tasks into one prompt
3.4 Your examples must be factually correct
This sounds obvious, but in practice… we all copy‑paste in a hurry.
The model assumes your examples are ground truth. If you smuggle in a bug, it’ll confidently reproduce it everywhere.
Example of hidden poison:
Example 1: Range ≥ 500km → subsidy 15,000
Example 2: Range 520km → subsidy 20,000
Now the model has to guess whether:
- You changed the policy halfway through, or
- You made a mistake
There’s no way for it to “correct” you; it’s not cross‑checking with the internet.
Treat your few‑shot block as production code:
- Double‑check numbers & thresholds
- Make sure all examples obey the same rule
- If you’re encoding law / medicine / finance, verify against the source document
3.5 Order your examples from simple → complex
Humans don’t like learning by being dropped straight into edge cases; neither do LLMs.
If your first example already mixes three special conditions, the model might:
- Over‑weight that weird case
- Miss the simpler underlying rule
A nicer pattern:
- Base case A
- Base case B
- Clear negative / boundary
- Then maybe one tricky combo
Example in subsidy land:
Example 1: Simple positive (type OK + range OK)
Example 2: Simple negative (type OK + range too low)
Example 3: Simple negative (wrong type)
Example 4: Complex (meets base rule + qualifies for extra top‑up)
Example 5: Complex (meets base rule but not the extra condition)
By the time you show examples 4 and 5, the model already knows the base rule and can now learn the extra dimension cleanly.
4. Cross‑industry case studies
Let’s look at how this plays out in different domains. Same technique, very different flavours.
We’ll keep the prompts compact but expressive; you can expand them in your own stack.
4.1 Medical: encoding a rare disease guideline (HAE)
Problem
You want the model to assist clinicians with a 2024 guideline on a rare disease:
Hereditary Angioedema (HAE) – recurrent non‑itchy swelling – C1‑INH deficiency – often misdiagnosed as allergy
The base model either:
- Has never seen the latest guideline, or
- Hallucinates based on generic “allergy” patterns
Instead of fine‑tuning, you inject the guideline logic as examples.
Few‑shot pattern
You define a task:
Task: Based on the 2024 HAE guideline, decide:
1) Does this case meet the diagnostic criteria?
2) What is the reasoning?
3) What is an appropriate initial management plan?
Then you show 3–4 cases that span:
- Typical family case
- Clear non‑HAE allergic case
- “Sporadic” case with no family history but positive lab markers
- Another non‑HAE case with itchy rash
Each example is shaped like:
Example 1
Input (summary):
- Recurrent facial swelling, non‑itchy
- Mother with similar episodes
- C1‑INH activity 20% (low)
Output:
1) Diagnosis: Consistent with HAE.
2) Reasoning: Non‑itchy angioedema + positive family history + low C1‑INH.
3) Management: …
Then your real query:
Now analyse this patient:
- 35‑year‑old male
- Recurrent eyelid swelling for 2 years
- No itch, no urticaria
- Father with similar symptoms
- C1‑INH activity 25% (low), C4 reduced
Return the same 3‑part structure.
The model doesn’t need to “know” HAE from training — it learns the diagnostic logic on the fly from your few examples.
You still need a human in the loop (this is medicine), but the model stops hallucinating generic “take some antihistamines” advice.
4.2 Finance: injecting a brand‑new IPO regulation
Problem
You’re building an assistant for investment bankers analysing 2025 IPO rules for a specific exchange (say, a revised science‑tech board in UK).
The rulebook changed in March 2025:
- New market‑cap + revenue + profit combos
- Special rules for red‑chip / dual‑class structures
- Stricter restrictions on use of proceeds
Your base model has never seen this document.
Few‑shot pattern
You declare a task:
Task: Based on the 2025 revised IPO rules, decide whether a company qualifies for listing on the sci‑tech board. Explain your reasoning by:
- Market cap & financial metrics
- Tech / “hard‑tech” profile
- Use of proceeds
- Any special structure (red‑chip, VIE, dual‑class)
Then you show three archetypal examples:
- Clean domestic hard‑tech issuer
- R&D‑heavy AI chip company
- Market cap ≥ 50B, revenue ≥ 10B, net profit ≥ 2B
- Proceeds used to build new chip fab
- Result: qualifies
- Red‑chip, loss‑making biotech
- Cayman / Hong Kong structure
- Market cap ≥ 150B, loss‑making but high R&D ratio
- Proceeds for overseas R&D centres
- Result: qualifies under special red‑chip route
- Non‑tech traditional business misusing funds
- Clothing manufacturer
- Market cap and profits below threshold
- Wants to use proceeds to buy wealth‑management products
- Result: does not qualify (fails both “hard‑tech” positioning and funds‑use rule)
Now when you feed in:
Company D:
- Hong Kong‑registered red‑chip
- Quantum computing hardware
- Projected post‑IPO market cap: 180B
- Still loss‑making; R&D / revenue ratio 30%
- No VIE structure
- Proceeds used to hire researchers and develop quantum algorithms
Question: Does it qualify? Explain by the same dimensions as the examples.
The model can map this onto the red‑chip example and apply the rule: “large market cap + genuine hard‑tech + loss‑making allowed under route X + proceeds used for R&D” → qualifies.
No regulation fine‑tuning required.
4.3 Programming: teaching Python 3.12 type parameter syntax
Problem
You’re generating code, but the base model:
- Was trained mostly on Python ≤ 3.11
- Keeps suggesting
TypeVarboilerplate - Doesn’t use Python 3.12’s new generic syntax
You want the model to write idiomatic 3.12 code, today.
Key new idea
Instead of:
from typing import TypeVar, List
T = TypeVar("T")
def identity(x: T) -> T:
return x
Python 3.12 lets you write:
def identity[T](x: T) -> T:
return x
And similarly for classes:
class Stack[T]:
...
Few‑shot pattern
Define the task:
Task: You are coding in Python 3.12.
- Prefer the new type parameter syntax: def func[T](...) → ...
- Prefer built‑in generics (list[T], dict[str, T]) over typing.List / typing.Dict.
- When you see older TypeVar patterns, refactor them into the new style.
Then show a few before/after examples.
Example 1 — basic function
# Before (pre‑3.12)
from typing import TypeVar
T = TypeVar("T")
def echo(x: T) -> T:
return x
# After (Python 3.12)
def echo[T](x: T) -> T:
return x
Example 2 — list generics
# Before
from typing import TypeVar, List
U = TypeVar("U")
def head(values: List[U]) -> U:
return values[0]
# After
def head[U](values: list[U]) -> U:
return values[0]
Example 3 — generic class
# Before
from typing import TypeVar, Dict
K = TypeVar("K")
V = TypeVar("V")
class SimpleCache:
def __init__(self) -> None:
self._data: Dict[K, V] = {}
def get(self, key: K) -> V | None:
return self._data.get(key)
# After
class SimpleCache[K, V]:
def __init__(self) -> None:
self._data: dict[K, V] = {}
def get(self, key: K) -> V | None:
return self._data.get(key)
Example 4 — dictionary helper (the “target” pattern)
Now you show a case very close to what you actually want:
# Before
from typing import TypeVar, Dict
T = TypeVar("T")
def get_value(d: Dict[str, T], key: str) -> T | None:
return d.get(key)
# After
def get_value[T](d: dict[str, T], key: str) -> T | None:
return d.get(key)
Now when you ask:
“Write a generic
lookupfunction in Python 3.12 that looks up a key in adict[str, T]and raisesKeyErrorif missing, using the new type parameter syntax.”
…you’ll get something like:
def lookup[T](store: dict[str, T], key: str) -> T:
if key not in store:
raise KeyError(key)
return store[key]
Exactly the behaviour you want — without touching the model weights.
4.4 Education: encoding a new curriculum standard
Problem
In 2025, the middle‑school math curriculum adds a new unit:
- “Data & algorithms basics”
- Simple visualisation (e.g. box plots)
- Basic algorithm concepts (e.g. bubble sort, described in natural language)
You’re building a teacher‑assistant tool that:
- Reviews lesson plans
- Flags whether they align with the new standard
- Explains why in teacher‑friendly language
Few‑shot pattern
Define the task:
Task: Given a lesson description for grades 7–9, decide:
1) Does it align with the 2025 “Data & Algorithms Basics” standard?
2) Why or why not?
3) If not, how could it be adjusted?
Show three archetypes:
- Good “box plot” lesson
- Students compute quartiles for real exam‑score data
- Draw box plots
- Discuss concentration and spread → Meets “data visualisation” requirement
- Good “bubble sort” lesson
- Teacher explains algorithm with diagrams
- Students describe steps in natural language
- No Python coding required → Matches “understand algorithm idea”, not “implement deep learning”
- Overkill AI lesson
- Teacher asks 9th graders to code a neural network in Python
- Predict exam scores from features → Way beyond the standard; flagged as misaligned
Then test case:
Lesson: Grade 8 “Data & Algorithms Basics”.
- Students design a simple random sampling plan to pick 50 students out of 2,000.
- They collect “weekly exercise time” data.
- They draw a box plot of exercise time and discuss the results.
Question: Aligned or not? Analyse using the three points as in the examples.
The model learns:
- Random sampling + box plots + interpretation
- Is very much in‑scope for the new standard
- And it can explain that in the same structure as the examples.
4.5 Law: encoding new labour‑law amendments
Problem
You have a legal assistant for HR that needs to know about 2025 labour‑law amendments, including:
- Occupational injury insurance for platform workers / gig economy
- How to measure working time under remote work
- Minimum compensation for non‑compete agreements
The base model doesn’t know about the 2025 changes.
Few‑shot pattern
Define the task:
Task: Given a short labour dispute case, do:
1) Legal assessment under the 2025 amendments
2) Cite the relevant new article(s) in plain language
3) Suggest next steps for the worker and/or employer
Show three examples:
- Gig worker injury
- Platform courier, no formal contract, injured on delivery
- Platform refuses compensation → New rule: factual employment + platform must pay into injury insurance → Suggest: apply for recognition of labour relationship, then injury claim
- Remote work time tracking
- Employer arbitrarily deducts pay by claiming “low efficiency”
- No proper time‑tracking system → New rule: use “effective working time” and require documented tracking → Suggest: ask employer for evidence; if none, pay must be corrected
- Non‑compete with absurdly low pay
- 2‑year non‑compete, but only 500 GBP/month compensation → New rule: at least 30% of average salary and no lower than minimum wage → Worker can refuse to comply or demand revised compensation
Then you ask about a design worker who:
- Worked fully remotely
- Got only 3,000 GBP in June, while their historical average is 8,000
- Employer never tracked “effective working time”
- Employer claims “your efficiency was low”
The model can now chain:
- This matches the remote work example
- Employer failed their duty to track time
- Worker can demand full pay based on historical average, minus what’s been paid
Again: the law text itself lives outside the model. The usable logic gets distilled into 3–4 examples.
5. Six common failure modes (and how to fix them)
Even with the right idea, few‑shot prompts can fail in subtle ways. Let’s debug some typical issues.
5.1 The model just parrots your examples
Symptom
- Works fine on seen cases
- On new input, it copies values from examples or repeats entire example answers
Likely causes
- Examples are purely concrete — lots of “case A / case B” but no visible rule
- Task wording doesn’t force generalisation
Fixes
- Explicitly encode the rule once
Instead of only:
Example: EV A, 500km → 15k
Example: EV B, 520km → 15k
Add a line like:
Rule: For all BEVs and PHEVs with range ≥ 500km, subsidy is 15,000 GBP.
- Align features
If your examples all use “Car: A, range 500, type BEV”, and your real task says “Model E, long‑range battery, SUV body style”… the model may not spot that “long‑range battery” = “range feature”.
Be boring and explicit:
- Always show the same fields:
Type=…,Range=…,BodyStyle=….
5.2 The model ignores edge cases
Symptom
- Handles normal cases well
- Fails on boundary or special cases that were in your examples
Example: you showed one “HAE with no family history” case, but the model still insists “no family history → no HAE”.
Likely causes
- Only a single edge example vs many normal examples
- Edge case buried at the end with no emphasis
Fixes
- Add at least two edge‑case examples
- Label them clearly, e.g.:
[Boundary case] Patient C: no family history, but low C1‑INH and typical symptoms → still HAE.
Explicit tags like [Boundary case] or [Special exception] help the model treat them as part of the rule, not noise.
5.3 Examples are too long (signal drowned in noise)
Symptom
- Prompt looks like a wall of text
- Model latches onto random parts (e.g. story flavour) instead of the core logic
- Sometimes it even ignores later examples because the context is overloaded
Likely causes
- You copy‑pasted entire documents instead of minimal supervised examples
- You included narrative fluff, version history, citations, etc.
Fixes
- Trim each example to:
Input → Output → Short explanation - Put long policy / guideline text in a separate section above, and summarise the operative rule in the example
Try to keep:
- Each example ≤ ~100 tokens if possible
- All examples + instructions under ~80% of your context window, leaving room for the actual query and model’s reasoning
5.4 Terminology collision across domains
Symptom
- You re‑use generic terms like “subsidy”, “margin”, “policy” in multiple domains
- The model mixes up meanings (e.g. subsidy in finance vs subsidy in automotive)
Likely causes
- No domain qualification: “subsidy” means different things in 3 prompts
- Inconsistent phrasing: “post‑IPO market cap” vs “total value after listing” vs “size”
Fixes
- Define terms in your instructions:
In this task, “market cap” always means “projected post‑IPO market cap” (shares × offering price).
- Use domain‑specific names where possible:
ev_subsidyipo_post_listing_market_caphae_lab_marker
Consistency beats cleverness.
5.5 Small models can’t juggle too many conditions
Symptom
- A big model (e.g. GPT‑4‑class) handles your prompt perfectly
- A smaller one (7B / 13B) seems to ignore half the conditions
Example: in HAE diagnosis, the small model only looks at symptoms but ignores lab values and family history.
Likely causes
- You’re asking it to learn a 3–4‑dimensional rule in one shot
- Each example mixes many conditions at once
Fixes
Break the problem into layers of examples:
- Single‑dimension examples
- Only symptoms: itchy vs non‑itchy swelling
- Two‑dimension examples
- Symptoms + lab values
- Three‑dimension examples
- Symptoms + lab values + family history
The smaller model can first lock in:
- “Non‑itchy swelling” vs “allergic swelling”
- Then “low C1‑INH” vs “normal”
- Then “family history strengthens the case”
Instead of being hit with all three dimensions at once.
5.6 Silent errors in your examples poison everything
Symptom
- Everything feels coherent
- But the model’s answers are consistently off from the real rules
When you go back and re‑read your few‑shot block, you find:
- One threshold is wrong
- Or two examples contradict each other
Fixes
Treat prompt debugging like code review:
- Validate against the source of truth
- Policy document
- Official guideline
- Legal article
- Check internal consistency
- Same thresholds everywhere
- Same units (km vs miles, %)
- No “≥ 450km” in one example and “≥ 500km” in the next
If you fix the examples and the model’s behaviour changes, you just proved that your few‑shot block is acting like a miniature “training set in context” — which is exactly the point.
6. Where this is going next
Few‑Shot‑in‑Context is not a temporary hack; it’s becoming part of the core LLM engineering toolbox, especially when combined with other techniques.
A few directions that are already practical today:
6.1 Few‑shot + RAG = dynamic knowledge injection
Instead of hard‑coding your examples into the prompt, you can:
- Store policy snippets / typical cases in a vector DB
- At query time:
- Retrieve the most relevant 3–5 items
- Format them as few‑shot examples
- Feed them to the model
You get:
- Up‑to‑date knowledge (change the DB, not the model)
- Domain‑specific behaviour
- No retraining loops
6.2 Multi‑modal few‑shot
With vision‑capable models, examples don’t have to be text‑only.
You can show:
- An image of a box plot + text interpretation → teach data‑viz reading
- A scan of a legal clause + structured summary → teach contract analysis
- A medical image + diagnosis → teach pattern recognition frameworks (with humans supervising)
The principle is the same: a tiny set of high‑quality, well‑structured examples sets the behaviour.
6.3 Personal and team‑level “knowledge presets”
For individuals and small teams, we’ll likely see:
- Tools that help you generate few‑shot blocks from your own notes
- “Profiles” that encode:
- Your coding style
- Your company’s policy interpretations
- Your favourite data formats
Think of it as a “soft fine‑tune” you can edit in a text editor.
7. Takeaways
If you remember only three things from this article, make them these:
- Fine‑tuning is rarely your first move. Often, you can get 80–90% of the value by injecting knowledge with 3–5 carefully chosen examples.
- Good few‑shot prompts encode logic, not trivia. Cover both normal and edge cases, be brutally consistent in format, and keep examples factual and compact.
- Few‑shot is a bridge between raw models and real products. It lets you adapt general‑purpose LLMs to fast‑moving, niche, or private knowledge — on your laptop, in minutes, without a GPU farm.
Once you start thinking of prompts as tiny, editable “on‑the‑fly training sets”, you stop reaching for fine‑tuning by default — and start shipping faster.