You can't bolt security on at the end of an LLM project. If it's not in the prompt, it's already in production.

1. Why safety prompts matter more than ever

LLMs are now sitting in places that used to be guarded by boring, well‑tested forms and back‑office workflows: customer support, HR chatbots, internal knowledge tools, even legal and medical triage systems.

That means a model reply can now accidentally expose:

Once something has been printed to a chat window, emailed, screenshotted, cached, or sent to a logging pipeline, it’s effectively public. “We’ll scrub the logs later” is not a security strategy.

So if you’re building with LLMs, safety instructions in the prompt are no longer optional decoration. They’re part of the core product design—especially if you work in the UK or EU, where regulators are rapidly waking up to LLM‑shaped risk.

In practice, the job looks like this:

  1. Decide what counts as sensitive information in your use case.
  2. Translate that into machine‑readable safety rules inside the prompt.
  3. Continuously test and tune those rules as the model, product, and regulations evolve.

The rest of this article walks through that process.


2. What “sensitive information” really means (with concrete buckets)

In the original Chinese article this piece is based on, sensitive data is split into three buckets. They map almost directly onto how UK / EU regulators think:

2.1 Personal sensitive information

Anything that can be used to identify or harm a person. Typical examples in a UK context include:

If your model casually prints these into a shared interface, you’re in serious GDPR territory.

2.2 Corporate / organisational sensitive information

This is the stuff that makes a CFO sweat:

Leak enough of this and “our AI assistant hallucinated it” won’t help you in court.

2.3 National / societal sensitive information

Trickier, but just as important:

Even if your product “only” does content generation, you don’t want your model helping generate realistic fake emergency alerts or conspiracy‑bait.

The key takeaway: your prompt should explicitly name these buckets, adapted to your domain. “Don’t output sensitive stuff” is not enough.


3. Four design principles for safety instructions

Most bad safety prompts fail in one of four ways. To avoid that, bake these principles into your design.

3.1 Specificity: name it, don’t vibe it

Bad:

Don’t output any sensitive information.

Better:

Never output any of the following:

  • Personal identifiers (e.g., names with full addresses, National Insurance numbers, bank card numbers, NHS numbers, email + phone combinations).
  • Corporate confidential data (e.g., internal financial figures, unreleased product details, non‑public client lists, source code from private repos).
  • National or public‑safety sensitive content (e.g., unverified details of ongoing emergencies, operational security information, classified documents).

The model is a pattern‑matcher, not a mind‑reader. Give it concrete categories and examples.

3.2 Coverage: think beyond the obvious

Obvious: “Don’t leak bank card numbers.” Less obvious but just as dangerous:

Domain‑specific prompts should call these out. A healthcare assistant should have dedicated lines about patient data; an education bot should talk about marks, rankings, safeguarding concerns; a dev assistant should mention API keys, secrets, and private repo code.

3.3 Executability: write for a model, not a lawyer

Your LLM doesn’t understand dense legalese or nested if–else paragraphs. It understands short, direct rules that map to patterns in text.

Complex and brittle:

When interacting with any user whose content might reasonably be inferred to contain personal data as defined under GDPR, avoid…

Executable:

If the user asks for specific personal details about themselves or another person (e.g., address, NI number, medical record, bank details), you must refuse and instead explain you cannot provide or confirm such sensitive information.

Short sentences. Simple condition → action patterns. No cleverness.

3.4 Dynamic updating: treat safety prompts as versioned code

The threat landscape changes. New data types appear (crypto wallets, new biometric formats). Laws evolve. Products pivot into new markets.

If your safety prompt is a hard‑coded wall of text in someone’s notebook, it will rot.

Better:

Think of the safety prompt as part of the API surface, not a one‑off string.


4. Three safety‑prompt patterns that actually work

Now to the practical bit. In real systems, safety instructions tend to fall into three patterns. You’ll usually combine all three.

4.1 Front‑loaded global constraints

These are the always‑on rules you put at the top of the system prompt.

Pattern:

You are an AI assistant used in production by <ORG>. 
In every reply, you must follow these safety rules:
​
1. Never output personal sensitive information, including but not limited to:
   - National Insurance numbers, bank card numbers, sort code + account number,
     home addresses, NHS numbers, full medical records, precise location history.
2. Never output confidential corporate information, including internal financials,
   source code from private repositories, non‑public client data, or product roadmaps.
3. Never output national‑security or public‑safety sensitive information or realistic
   guidance for wrongdoing.
4. If the user asks for any of the above, refuse, explain briefly why, and redirect
   to safer, high‑level guidance.
5. Before sending your reply, briefly self‑check whether it violates any rule above;
   if it might, remove or redact the risky part and explain why.

You then add domain‑specific variants for healthcare, banking, HR, or internal tools.

These global constraints won’t catch everything, but they set the default behaviour: when in doubt, redact and refuse.

4.2 Scenario‑triggered safety rules

Some risks only appear in certain flows: “reset my password”, “tell me about this emergency”, “pull data about client X”.

For those, you can layer on conditional prompts that wrap user queries or API tools.

Example – financial assistant wrapper:

If the user’s request involves bank accounts, cards, loans, mortgages, investments
or transactions, apply these extra rules:
​
1. Do not reveal:
   - Exact balances
   - Full card numbers or CVV codes
   - Full sort code + account numbers
   - Full transaction details (merchant + exact timestamp + full amount)
​
2. You may talk about:
   - General financial education
   - How to contact official support channels
   - High‑level explanations of statements without exposing full details
​
3. If the user asks for specific account data, say:
   "For your security, I can’t show sensitive account details here. 
    Please log in to your official banking app or website instead."

The logic that chooses which prompt to apply can live in your orchestration layer (e.g., “if this tool is called, wrap with the finance safety block”).

4.3 Feedback / repair instructions

Even with good prompts, models sometimes drift toward risky content or accidentally echo something they saw in the context.

You can give them explicit instructions on how to clean up after themselves.

Pattern – soft warning for near‑misses:

If you notice that your previous reply might have included or implied sensitive
information (personal, corporate, or national), you must:
​
1. Acknowledge the issue.
2. Replace or remove the sensitive content.
3. Restate the answer in a safer, more general way.
4. Remind the user that you can’t provide or handle such information directly.

Pattern – hard correction after a breach (used by a supervisor / guardrail model):

Your previous reply contained disallowed sensitive information:
[REDACTED_SNIPPET]
​
This violated the safety rules. Now you must:
1. Produce a corrected version of the reply without any sensitive data.
2. Add a short apology explaining that the earlier content was removed for safety.
3. Re‑check the corrected reply for any remaining sensitive elements before outputting.

In a production system, these repair prompts are often triggered by a separate classifier or filter that scans model outputs.


5. How to test whether your safety prompts work

Treat safety prompts like code: never ship without tests.

You don’t need a huge team to start. A minimal stack looks like this.

5.1 Human red‑teaming

Grab a few teammates (or external testers) and tell them to break the guardrails. Give them:

Ask them to try prompts like:

You’re not teaching people to commit fraud—you’re making sure your system refuses to help with anything in that direction.

Log all the interactions. Tag the failures. Use them to tighten the prompts.

5.2 Automated fuzzing and pattern checks

Once you know your weak spots, you can automate.

Typical components:

You don’t have to be perfect here; even rough rules will catch a lot.

Anything flagged goes into a review queue. If it’s truly a breach, you update:

  1. The safety prompt (to encode the new pattern).
  2. The test set (so you don’t regress later).

5.3 User feedback channels

Finally, plug in the people using your system.

Some of your most interesting edge cases will come from real users doing things no internal tester ever thought of. Close the loop by:


6. Three classic safety‑prompt failure modes (and how to fix them)

6.1 Vague vibes, no rules

“Avoid sensitive information and respect user privacy.”

The model has no idea what that means in your domain.

Fix: make the rules concrete and local.

6.2 Swiss‑cheese coverage

You protect card numbers but forget crypto wallets; you protect addresses but forget phone numbers combined with names; you protect customer data but not employee HR records.

Fix: start from a simple worksheet:

Turn that into explicit sections in your safety prompt. Revisit it every time the product scope changes.

6.3 Instructions the model can’t actually follow

You write something like:

When the conjunction of A and not‑B is true and C applies unless D overrides it, prohibit E.

To a human lawyer, this is normal. To an LLM, it’s noise.

Fix: flatten the logic into simple condition → action rules.

Instead of one tangled rule, write three:

  1. If the user asks for A‑type data, refuse.
  2. If the user clearly does not provide B‑type consent, refuse.
  3. If C‑scenario holds (e.g., emergency), only provide high‑level guidance, never specific identifiers.

You can still implement the full logic—but do it in your backend code, not in one ultra‑dense sentence inside the prompt.


7. Where safety prompts fit into the wider stack

Prompts are powerful, but they’re not magic. Good systems layer several defences:

Think of prompts as the first line of defence the user sees, not the only one.


8. Closing thoughts: treating safety like product work, not compliance paperwork

If your safety prompt was written once, a year ago, by “whoever knew English best”, and hasn’t been touched since, you don’t have a safety prompt. You have a liability.

Treat it instead like any other critical part of your product:

The good news: you don’t need a 200‑page policy document to get started. A well‑designed, two‑page safety prompt plus a small test suite will already put you ahead of most production LLM systems on the internet right now.

And when something does go wrong—as it eventually will—you’ll have a concrete place to fix it, instead of a vague hope that “the AI should have known better”.