Multilingual Prompt Injection Exposes Gaps in LLM Safety Nets

I came across a post on Twitter from a security researcher who claimed they’d bypassed multiple LLM runtime safeguards, including Azure Content Filter, simply by switching their prompt injection payloads to Thai and Arabic. The result? $37,500 in bug bounties across programs.

I wasn’t really surprised, to be honest, because most safety nets in existence have a huge language hole in them. If you are thinking of security as an afterthought or add-on, multilingual prompt injection is one of the clearest example why this is a really horrible idea.

What Is Multilingual Prompt Injection?

The basis of prompt injection exploits is the fact that LLMs can’t reliably distinguish between instructions and data. A well-crafted input can convince the model to ignore its system prompt, leak sensitive information, or take unintended actions through connected tools.

Multilingual prompt injection takes this a step further. Instead of crafting the payload in English, where safety filters are strongest, the attacker translates it into another language. The model still understands the instruction because its multilingual, but the safety layer often doesn’t catch it.

Think of it like having a bouncer at a nightclub who only speaks English, you can walk right past them if you give the password in Mandarin. The door still opens, the bouncer just didn’t understand what you said.

Why It Works

The root cause is straightforward as all and every safety training are disproportionately built around English-language data.

Safety tuning is language-lopsided. When models undergo reinforcement learning from human feedback, the vast majority of examples used to teach the model what’s “safe” and “unsafe” are in English. The model learns strong boundaries in English. In non-English, those boundaries are sometimes barely there.

Content filters have blind spots. Runtime safety layers like Azure Content Safety, AWS Bedrock Guardrails, and similar tools are effectively classification models. They’re trained to detect harmful patterns in text, but as Microsoft’s own documentation notes, their Prompt Shields are trained and tested primarily on a handful of languages such as Chinese, English, French, German, Spanish, Italian, Japanese, and Portuguese. That leaves a large chunk of the world’s languages as potential bypass vectors.

One 2025 comparison of leading guardrail solutions found that none of the major platforms including Azure Content Safety and Amazon Bedrock, had validated multilingual prompt injection defenses, particularly for languages like Chinese. The gap is even wider for lower-resource languages.

Tokenization compounds the problem. LLMs process text through tokenizers, and most tokenizers are optimized for English and other Latin-script languages. Non-Latin scripts like Arabic, Thai, or Khmer often get fragmented into more tokens, which can change how the model interprets the input and how filters evaluate it. This tokenization asymmetry creates additional blind spots that attackers can exploit.

The Attack Surface Is Wider Than You Think

Multilingual prompt injection isn’t limited to a single technique. From what I’ve seen in the field and in published research, there are several patterns worth understanding:

Direct translation. The simplest approach: take an English payload that would get blocked, translate it into a lower-resource language, and submit it. This works surprisingly often because the model’s capabilities (understanding the instruction) outpace its safety training (recognizing it as harmful) in that language.

Code-switching and mixed-language prompts. Rather than using a single non-English language, attackers mix languages within a single prompt. This confuses both the model’s safety alignment and external filters, which struggle to evaluate context across language boundaries.

Geopolitical obfuscation. Recent research has demonstrated an even more sophisticated technique: fragmenting a prompt across multiple languages chosen based on their geopolitical distance from the subject matter. For example, describing one element in Swahili and another in Thai creates an obfuscation layer that prevents safety filters from recognizing the relationships between entities in the prompt while the generation model still assembles the full picture.

Voice and accent exploitation. This extends beyond text. Voice-based AI agents that were primarily trained on certain accents may parse other accents less reliably, creating gaps where injected instructions slip through. If the speech-to-text pipeline misinterprets input, downstream safety filters never see the actual intent.

Conclusion

Multilingual prompt injection is a symptom of a deeper problem as safety and capability are advancing at different speeds, and that gap is widest for non-English languages. Models get more capable across languages with every release, safety coverage doesn’t keep pace.

The good news is that awareness is growing as OWASP has elevated prompt injection to the top of its LLM risk list. Bug bounty programs are rewarding multilingual bypass discoveries and researchers are publishing work on cross-lingual safety gaps. The problem is actually on the radar.

But awareness without action is just another afterthought. And with AI systems, afterthoughts have consequences.

Multilingual Prompt Injection Exposes Gaps in LLM Safety Nets

What Is Multilingual Prompt Injection?

Why It Works

The Attack Surface Is Wider Than You Think

Conclusion

References