AI Security 8 min read By XPWD Team

AI Model Supply Chain Security: From Training Data to Deployment

Downloading a pretrained model is downloading an executable artifact from a third party. The ML supply chain has its own failure modes, distinct from — and less mature than — traditional software supply chain security.

Software supply chain security has a decade of tooling behind it now: SBOMs, dependency scanning, signed packages, reproducible builds. The ML supply chain — the path from training data to a deployed model — is a different problem, and most organizations applying their software supply chain playbook to it are missing failure modes specific to models.

Dataset Provenance

A model is only as trustworthy as the data it learned from, and most teams using public or third-party datasets have no real chain of custody for that data — no record of who collected it, how it was filtered, or whether it's been tampered with since. Unlike a software dependency, a dataset's "vulnerabilities" don't show up in a CVE database; they show up as biased, backdoored, or low-quality model behavior discovered after deployment.

Pretrained Model Provenance

Downloading a pretrained model from a public hub is, functionally, downloading an executable artifact from a third party — and historically, some model serialization formats (pickle-based formats in particular) allow arbitrary code execution on load. A malicious or compromised model upload isn't just a bad-prediction risk; depending on the format and loading code, it can be a remote-code-execution risk on the machine that loads it.

Fine-Tuning Data Poisoning

Production systems that fine-tune on user feedback, support transcripts, or other continuously collected data have a poisoning vector that traditional software doesn't: an attacker who can influence what gets fed into the next fine-tuning run can implant behavior — backdoor trigger phrases, biased outputs, degraded safety behavior — that survives into the deployed model with no single "malicious commit" to point to in review.

MLOps Pipeline Security

The CI/CD-equivalent pipeline for models — data ingestion, training orchestration, evaluation, deployment — needs the same hardening as any other CI/CD pipeline: access controls on who can trigger a training run or push a model to production, integrity checks between pipeline stages, and audit logging of every model promotion. Most of the tooling exists; the gap is usually that ML pipelines were built by data science teams without the same security review software pipelines get.

Model Registry Integrity

A model registry needs the same controls as an artifact registry for software: signed artifacts, version pinning, and verification that the model actually deployed matches the model that passed evaluation — not a different artifact swapped in between approval and deployment.

What a "Model Bill of Materials" Looks Like

Borrowing the SBOM concept: a practical Model BOM records the training data sources and their provenance, the base model and its version/checkpoint if fine-tuned from a foundation model, the fine-tuning datasets and process, evaluation results at each stage, and the deployment artifact's hash and signature. This doesn't need to be exotic tooling — a structured record alongside the model artifact, reviewed at each promotion gate, covers most of the practical need.

Practical Steps

The Bottom Line

The ML supply chain has the same shape as the software supply chain — data and components flowing from untrusted third parties into a production artifact — but the tooling maturity is years behind, and a model's "vulnerabilities" don't show up where a software dependency scanner would look. Treat models and the data that built them as supply chain artifacts requiring provenance and integrity verification, not as a black box you trust because the accuracy metrics looked good in evaluation.

#AI Supply Chain#Model Security#MLOps#Model Provenance
Back to Blog