🖼
AI Security 7 min read By XPWD Team

Multimodal Prompt Injection: Attacks via Images, Audio, and Video

Text-based prompt injection is well documented. The same problem now extends to images, audio, and video — and provenance gets a lot harder when the instruction is hiding in a pixel, not a sentence.

Prompt injection research has focused overwhelmingly on text, for good reason — that's where most production LLM applications still operate. But as vision-language and audio-capable models move into production, the same fundamental problem — the model can't reliably distinguish "instructions" from "content" — now applies to pixels and waveforms, not just words.

Image-Based Injection

Hidden Text in Low-Contrast or Off-Canvas Regions

Text rendered at extremely low contrast against its background, or placed outside the visible crop a human would notice but within what a vision model processes, can carry instructions invisible to a human reviewer but legible to the model's OCR-like processing of the image.

Adversarial Perturbations

Pixel-level perturbations imperceptible to humans can shift a vision model's classification or description of an image — relevant wherever an application's downstream logic depends on the model correctly identifying image content (content moderation systems being an obvious target).

Audio-Channel Injection

For voice assistants and audio-processing pipelines, instructions can be embedded in audio at frequencies or volumes designed to be inaudible or unnoticeable to a human listener but still processed by the model's speech recognition — background audio in a video call, a podcast playing nearby, or steganographically embedded in music. The attack surface here mirrors ultrasonic and "DolphinAttack"-style research against traditional voice assistants, now extended to LLM-based audio processing.

Video Frame Injection

Video adds a temporal dimension: an instruction can appear in a single frame for a fraction of a second — long enough for frame-by-frame model processing to catch it, short enough that a human watching at normal speed never notices.

Why Provenance Gets Harder

Text-based defenses increasingly rely on tagging content by source — "this came from the user," "this came from an external document" — and treating external sources with more suspicion. That's harder to do cleanly across modalities: an image can contain both the legitimate visual content a user intended to share and an injected instruction in the same frame, with no clean way to separate "trusted intent" from "untrusted payload" at the pixel level the way you might delimit it in a text prompt.

Mitigations, Such As They Are

The Bottom Line

Every modality a model can process is a modality an attacker can inject through. As production systems add vision and audio capability, the prompt injection threat model needs to expand with them — and right now, defensive tooling for non-text modalities lags well behind what exists for text.

#Multimodal AI#Prompt Injection#AI Security#Adversarial ML
Back to Blog