If you’ve ever asked an LLM for a “clear explanation” and got back the same sentence wearing three different hats, welcome to the club.
Repetition usually isn’t malice—it’s math. Decoding is probability, and the highest-probability path often loops through safe phrases (“therefore”, “in conclusion”, “from a broader perspective”) like a commuter stuck on the Manchester–London line with two signal failures and a mysterious “operational issue”.
Penalties are the knobs that break that loop.
Why LLMs Repeat (And Why It’s Not Your Fault)
Repetition happens for three very boring reasons:
1) Decoding follows the highest-probability trail
At each step, the model picks the next token based on probabilities. Once a phrase becomes likely, it can keep becoming likely (momentum + local coherence).
2) Training data contains a lot of templates
Web writing is full of ritual phrases. The model learns them because they appear everywhere—and because they “work” statistically.
3) Your prompt leaves the exit door open
If you don’t define scope, format, and constraints, the model will often “pad” to sound complete—by paraphrasing itself.
You can fix #3 with better instructions. You fix #1 and #2 with penalties.
The Penalty Trio (What They Actually Do)
Different platforms name these slightly differently, but the underlying idea is the same: change the odds of generating certain tokens.
Frequency penalty (token reuse tax)
In OpenAI-style APIs, frequency_penalty reduces the probability of tokens that have already appeared, proportionally to how often they appeared. Positive values make the model less likely to repeat itself.
Use it when:
- Your output repeats the same adjectives (“powerful”, “efficient”, “seamless”).
- The model keeps looping on the same “key benefit”.
Typical starting range:
- 0.3–0.8 for long-form explanations
- 0.8–1.2 for marketing copy (careful: too high can get weird)
Presence penalty (topic-hopping nudge)
presence_penalty penalizes tokens simply for having appeared at all, which encourages introducing new tokens/topics rather than staying on the same rails.
Use it when:
- The model keeps circling one idea without adding new dimensions.
- You want broader coverage (“give me 8 distinct angles”).
Typical starting range:
- 0.2–0.7
Repetition penalty (open-source classic)
In Hugging Face/Transformers style generation, repetition_penalty is a single multiplier applied to previously generated tokens. 1.0 means no penalty; values above 1.0 penalise repeats.
Use it when:
- You’re running LLaMA/Mistral/Qwen locally and you see literal repeated phrases.
- The model starts “stuttering” until it hits the max token limit.
Typical starting range:
- 1.1–1.3
- Above 1.5 can produce “funky outputs” in some tokenizers/models (it’s a blunt instrument).
A Practical Tuning Playbook (No Guessing, No Vibes)
Here’s a simple workflow that works in real projects:
Step 1: Fix the prompt first (free wins)
Add these constraints:
- Output format: bullet list, table, numbered steps
- Hard limits: word count or max items
- Anti-dup rule: “Don’t repeat the same phrase; each point must add new information.”
- Stop words: “Avoid ‘in conclusion’, ‘overall’, ‘therefore’.”
Step 2: Add penalties gradually
Start conservative, then adjust by 0.1 increments.
- If you see phrase loops, increase
frequency_penalty. - If you see one-idea spirals, increase
presence_penalty. - If you’re on open-source inference, try
repetition_penaltybefore you rewrite the entire prompt.
Step 3: Measure, don’t just feel
A quick heuristic you can apply without tools:
- If a sentence “could be swapped” with an earlier sentence and still make sense… it’s probably redundant.
- If you can highlight the same adjective 5+ times in 300 words… your frequency penalty is too low.
Battle-Tested Settings by Scenario
1) Long reports (e.g., “UK fintech trends in 2026”)
Goal: avoid repeating the same argument with different wording.
Suggested settings:
frequency_penalty: 0.5–0.8presence_penalty: 0.2–0.5
Prompt add-ons:
- “Each section must introduce at least one new example (UK/EU).”
- “No repeated transition phrases.”
2) Marketing copy (short, punchy, no buzzword soup)
Suggested settings:
frequency_penalty: 0.9–1.2presence_penalty: 0.1–0.4
Extra trick:
- Provide a banned word list (e.g., “innovative, cutting-edge, next-gen”).
3) Customer support (multi-turn, avoid copy/paste)
Suggested settings:
frequency_penalty: 0.3–0.6presence_penalty: 0.2–0.5
Prompt add-ons:
- “Don’t repeat instructions already given; only add missing steps.”
- “Keep greeting to one short phrase per reply.”
4) Open-source inference stutter (local models)
Suggested settings:
repetition_penalty: 1.15–1.30- plus normal sampling choices (temperature/top-p) as needed
Code: A Small Example You Can Actually Use
Here’s a minimal Node.js snippet that uses both penalties to reduce repetition when generating a short explainer (example topic: “why council tax bills can look confusing”). (I’m keeping it intentionally compact—paste, run, iterate.)
import OpenAI from "openai";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const prompt = `
Write a concise UK-focused explanation (max 180 words) of why a council tax account balance
can look higher than expected. Use 4 bullet points. Avoid filler phrases like "overall" or
"in conclusion". Each bullet must introduce a distinct reason.
`;
const res = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
temperature: 0.6,
frequency_penalty: 0.7,
presence_penalty: 0.3
});
console.log(res.choices[0].message.content);
What to tweak:
- Still seeing repeated phrasing? Push
frequency_penalty→ 0.8. - Output feels like it’s jumping topics? Pull
presence_penalty→ 0.2. - Text sounds unnatural? Lower both by 0.1.
(These penalty parameters are part of OpenAI’s API surface for chat/completions.)
The Most Common Penalty Mistakes (And How to Dodge Them)
Mistake 1: “Zero repetition” as a goal
Over-penalising forces the model into awkward synonyms (“this caffeinated beverage” instead of “coffee”). Fix: keep penalties moderate, and enforce structure with the prompt.
Mistake 2: Using penalties to compensate for vague prompts
Penalties are not a replacement for constraints. They’re a multiplier on good instructions.
Mistake 3: Copying settings between tasks
A marketing penalty profile will wreck academic writing (you want consistent terminology in papers).
Mistake 4: Forgetting model differences
Open-source models can be more sensitive to repetition controls; increase slowly and watch coherence.
A Simple Checklist Before You Ship
- Prompt has a format and a length cap
- Prompt includes an explicit no-dup rule
- Penalties start in the conservative range
- You adjusted in 0.1 steps, not giant jumps
- Output reads naturally to a human (not a thesaurus)
Final Thought
Penalties aren’t about “making the model obey”. They’re about nudging decoding away from the comfy, repetitive path and back into useful language.
Treat them like seasoning:
- Too little → bland, repetitive mush
- Too much → weird, inedible synonyms
- Just right → your output stops sounding like it’s writing to hit a word count.
Now go make your model shut up productively.