Prompt injection research has focused overwhelmingly on text, for good reason — that's where most production LLM applications still operate. But as vision-language and audio-capable models move into production, the same fundamental problem — the model can't reliably distinguish "instructions" from "content" — now applies to pixels and waveforms, not just words.
Image-Based Injection
Hidden Text in Low-Contrast or Off-Canvas Regions
Text rendered at extremely low contrast against its background, or placed outside the visible crop a human would notice but within what a vision model processes, can carry instructions invisible to a human reviewer but legible to the model's OCR-like processing of the image.
Adversarial Perturbations
Pixel-level perturbations imperceptible to humans can shift a vision model's classification or description of an image — relevant wherever an application's downstream logic depends on the model correctly identifying image content (content moderation systems being an obvious target).
Audio-Channel Injection
For voice assistants and audio-processing pipelines, instructions can be embedded in audio at frequencies or volumes designed to be inaudible or unnoticeable to a human listener but still processed by the model's speech recognition — background audio in a video call, a podcast playing nearby, or steganographically embedded in music. The attack surface here mirrors ultrasonic and "DolphinAttack"-style research against traditional voice assistants, now extended to LLM-based audio processing.
Video Frame Injection
Video adds a temporal dimension: an instruction can appear in a single frame for a fraction of a second — long enough for frame-by-frame model processing to catch it, short enough that a human watching at normal speed never notices.
Why Provenance Gets Harder
Text-based defenses increasingly rely on tagging content by source — "this came from the user," "this came from an external document" — and treating external sources with more suspicion. That's harder to do cleanly across modalities: an image can contain both the legitimate visual content a user intended to share and an injected instruction in the same frame, with no clean way to separate "trusted intent" from "untrusted payload" at the pixel level the way you might delimit it in a text prompt.
Mitigations, Such As They Are
- Modality-aware content filtering — OCR-scan images for embedded text before they reach the model's visual processing, and treat any detected text as untrusted input subject to the same filtering as a text prompt.
- Adversarial robustness testing — test vision pipelines specifically against known perturbation techniques relevant to your model architecture, not just functional accuracy testing.
- Provenance tagging at ingestion — track and flag where each piece of media entered the pipeline (user upload vs. scraped from an external source vs. generated internally), even though it won't fully solve in-content separation.
- Human review for cross-modal, high-stakes actions — any action triggered by interpretation of an image, audio clip, or video (not just a chat response) should sit behind the same approval-gate discipline as any other consequential agent action.
The Bottom Line
Every modality a model can process is a modality an attacker can inject through. As production systems add vision and audio capability, the prompt injection threat model needs to expand with them — and right now, defensive tooling for non-text modalities lags well behind what exists for text.
Back to Blog