Adversarial ML research has produced a deep literature on two attack classes that get far less practitioner attention than prompt injection or jailbreaking, mostly because they target the model and its training data rather than the application built on top — but both have direct, practical consequences for anyone deploying a proprietary model or a model trained on sensitive data.
Model Extraction
Model extraction (also called model stealing) is the process of querying a deployed model enough times, with carefully chosen inputs, to train a substitute model that closely approximates the original's behavior — effectively replicating a model's functionality, and sometimes its specific learned weights or decision boundaries, without ever accessing the original weights directly.
Why it matters: for any organization whose competitive advantage is a proprietary model — accuracy on a specific task, a particular fine-tuning approach, domain expertise baked into training — extraction lets a competitor or attacker replicate that value through nothing but API access, at a fraction of the original training cost.
What it looks like in practice: an unusually high volume of queries from a single source, often systematically covering the input space in a way organic user traffic wouldn't, sometimes including queries near decision boundaries specifically chosen to maximize information extracted per query.
Membership Inference
Membership inference determines whether a specific data record was part of a model's training set, by analyzing how the model responds to that record versus records it has never seen — models tend to behave subtly differently (often with higher confidence or lower loss) on data they were trained on.
Why it matters: this is fundamentally a privacy attack. If a model was trained on sensitive records — patient data, financial records, private communications — and an attacker can determine that a specific individual's data was in that training set, that's a disclosure in itself, independent of whether any specific content is reconstructed. This is a live regulatory concern for any model trained on regulated personal data.
Why These Matter More as Models Scale
Both attack classes get more practically relevant as more proprietary value gets embedded in deployed models and as more models get trained on sensitive operational data rather than only public datasets. A model fine-tuned on internal support tickets, proprietary research, or customer records carries both extraction value (the model embodies investment worth protecting) and membership inference risk (specific training records may be individually sensitive) simultaneously.
Practical Mitigations
Against Model Extraction
- Rate limiting and query budget enforcement per API key/account — extraction requires a high query volume; raising the cost of that volume raises the cost of the attack.
- Output perturbation or rounding — reducing the precision of returned confidence scores or probabilities limits how much information an attacker extracts per query, with some accuracy trade-off for legitimate users.
- Monitoring query patterns for systematic input-space coverage characteristic of extraction attempts, distinct from organic usage patterns.
- Watermarking model outputs, where feasible, to support after-the-fact attribution if a substitute model surfaces elsewhere.
Against Membership Inference
- Differential privacy techniques during training, which bound how much any single training record can influence the model's output — the most rigorous available defense, at some accuracy cost.
- Regularization and limiting overfitting generally, since models that memorize training data more strongly are more vulnerable to membership inference; a well-generalized model is incidentally more resistant.
- Strict output precision limits for any model trained on sensitive data, for the same reasons as the extraction mitigation.
The Bottom Line
Neither of these attack classes requires exotic access — both work entirely through the API surface most models already expose to legitimate users. If a model embodies real proprietary value or was trained on data that's sensitive at the individual-record level, query-pattern monitoring and output precision controls deserve a place in the threat model, not just a footnote in an academic paper nobody on the security team has read.
Back to Blog