Model Extraction vs. Model Inversion: The AI Attacks That Are Already Targeting Your Stack

The Direct Answer
Model extraction steals a model's capability by querying it and training a copy on the outputs,no access to weights or training data required. Model inversion steals the model's training data by probing its outputs in reverse. The first is intellectual property theft. The second is a privacy breach. Both attacks require only API access. The Alibaba-Anthropic case in 2026 is the largest documented example of model extraction ever recorded.
What the Alibaba-Anthropic Case Established
- Alibaba's Qwen lab ran 28.8 million interactions with Claude via ~25,000 fraudulent accounts between April 22 and June 5, 2026. (Anthropic, letter to U.S. Senate Banking Committee, June 10, 2026)
- Anthropic called it the largest known distillation attack to date: exceeding the combined total of three previous Chinese AI lab campaigns (DeepSeek, Moonshot AI, MiniMax) which involved ~24,000 accounts and 16 million exchanges.
- The campaign specifically targeted Claude's agentic reasoning and software engineering capabilities, the same frontier capabilities Anthropic has positioned as the core of its enterprise offering.
- The campaign started April 22: two days before the April 24 White House OSTP memo in which OSTP Director Michael Kratsios warned of exactly this threat, and then ran for six more weeks in open defiance of it. Anthropic pointedly included this timing in its letter.
- Anthropic warned that models built via adversarial distillation often lack safety guardrails, creating downstream risk beyond IP theft.
- The letter was sent to Senators Tim Scott and Elizabeth Warren ahead of a Senate Banking Committee hearing on AI, signaling where regulation is heading.
- These are generally Anthropic's allegations, Alibaba had not publicly responded as of late June, and whether distillation via a public API is IP theft is still an open legal question
What Is Model Extraction?
Model extraction, also called model stealing or a distillation attack, works like this: an attacker sends a large volume of carefully crafted inputs to a target model via its API and collects the input-output pairs. Those pairs are then used to train a separate model that approximates the target's behavior. No access to weights. No access to training data. Just the API.
The reason it works is that model outputs encode capability. A sufficiently large and diverse set of queries will surface enough of the model's decision boundaries that a trained surrogate can replicate much of its performance on the tasks the attacker cares about.
The Alibaba operation is the clearest large-scale example of this in public record. The Qwen lab didn't breach Anthropic's infrastructure. They opened 25,000 accounts, generated tens of millions of exchanges, and trained on the outputs: a technique known as knowledge distillation that is standard in ML research, but is a terms-of-service violation and potential IP theft when applied to a competitor's commercial model without authorization.
The economics explain the incentive. Training a frontier model from scratch costs hundreds of millions of dollars and years of work. Extraction costs API credits and compute time to train the surrogate. For any lab trying to close the capability gap quickly, the math is obvious.
What Is Model Inversion?
Model inversion is a fundamentally different attack with a different goal. Where extraction tries to copy what the model does, inversion tries to recover what the model was trained on.
The attack works by probing a model's outputs, typically confidence scores or probability distributions, to reverse-engineer examples that resemble the training data. In facial recognition models, this has been used to reconstruct images of individuals whose faces appeared in training sets. In clinical ML models, similar techniques have been used to recover patient records. In language models, it can surface proprietary or sensitive text that was included in pretraining.
Inversion is harder to execute than extraction and requires more technical sophistication, which is why it gets less coverage. But from a legal and compliance standpoint, it is arguably the more dangerous of the two. A successful inversion attack on a model trained on medical records, financial data, or user-generated content doesn't just hurt the model owner, it directly harms the people whose data was used.
For engineering leaders, this matters in both directions: what can be extracted from models you consume via API, and what training data could be surfaced from models you build and expose.
Side-by-Side: The Key Differences
Both attacks require only API access. Neither requires the attacker to ever touch the model's weights or training data directly.
Why Engineering Leaders Should Care Now, Not Later
The attack surface is every model API you expose, not just frontier labs.
Anthropic and OpenAI have the resources and visibility to detect and disclose these campaigns. Most companies building on top of or alongside LLMs do not. If you've fine-tuned a model on proprietary data and exposed it via API, internally or externally, that model is extractable. If you've trained on sensitive data and your model returns probability distributions, inversion attacks are within scope.
Models built via adversarial distillation often ship without safety guardrails.
Anthropic made this point explicitly in their Senate letter and it's worth taking seriously beyond the IP framing. A model extracted from Claude doesn't inherit Claude's RLHF, constitutional AI training, or safety evaluation infrastructure. It inherits capability, not alignment. The downstream risk isn't just to Anthropic; it's to anyone who ends up interacting with a guardrail-stripped Claude surrogate deployed elsewhere in the ecosystem.
Regulation is moving in response to these cases, not ahead of them.
The Alibaba letter going to the Senate Banking Committee is not a press move, it is a deliberate signal to policymakers ahead of a scheduled hearing. Engineering leaders who understand the distinction between extraction and inversion will be better positioned when compliance requirements arrive, rather than scrambling to retrofit controls after the fact. Senators Hagerty and Kim are moving a defense-bill amendment to blacklist/sanction entities doing adversarial distillation, with a related House bill in play.
What You Can Actually Do About It
There is no silver bullet, and most defenses are still maturing. But there are concrete steps worth taking now.
For extraction:
- Monitor query patterns at the API layer. 28.8 million calls through 25,000 accounts is detectable if you're watching. Unusual query distribution, high volume from new accounts, and systematic coverage of capability domains are all signals. Most teams aren't looking for them.
- Rate limit aggressively on new accounts. The Alibaba operation required fraudulent account creation at scale. Behavioral rate limits that throttle new or unverified accounts reduce the extraction surface significantly.
- Consider output watermarking. Cryptographic and statistical watermarking of model outputs is an active research area. It allows model owners to detect when outputs were used to train a competing model. Not production-ready across the board, but worth tracking.
For inversion:
- Audit what went into your training data. Inversion risk is proportional to the sensitivity of training data and the expressiveness of your model's outputs. If regulated data (PII, PHI, financial records) is in your training set, inversion risk should be on your threat model.
- Truncate output distributions where possible. Returning only the top prediction rather than full probability distributions significantly raises the bar for inversion attacks.
- Run membership inference tests before you expose a model. Membership inference is a related technique that reveals whether a specific data point was in a model's training set. If your model is vulnerable to membership inference, it's more likely to be vulnerable to inversion.
For both:
- Review your contractual exposure if you're building on third-party APIs. If you're feeding proprietary data into a third-party model, understand what that provider's ToS says about data usage, training, and retention.
Closing
The Alibaba-Anthropic case isn't an anomaly. It is the largest documented example of a pattern that has been accelerating since early 2026, when Anthropic first disclosed similar campaigns by DeepSeek, Moonshot AI, and MiniMax. As AI capability concentrates in a smaller number of frontier models, the incentive to extract rather than build will only grow.
Model extraction and model inversion are distinct threats that require different defenses and carry different legal risks. Engineering leaders who understand the difference will ask better questions of their vendors, build more defensible systems, and be better prepared when regulators start asking questions of them.
Frequently Asked Questions
What is the difference between model extraction and model inversion?
Model extraction copies what a model does by training a surrogate on API outputs. Model inversion recovers what a model was trained on by probing its outputs in reverse. Extraction is an IP and competitive threat to the model owner. Inversion is a privacy threat to the individuals or organizations whose data appears in the training set.
Is model extraction illegal?
Model extraction as practiced in the Alibaba-Anthropic case violates terms of service and may constitute trade secret misappropriation or IP theft under applicable law, the legal framing is still developing. Anthropic's disclosure to the U.S. Senate Banking Committee suggests they are pursuing both policy and potentially legal remedies. The act of knowledge distillation itself is not illegal when applied to models you own or have licensed for that purpose.
How did Alibaba extract Claude?
According to Anthropic's June 10, 2026 letter to U.S. senators, operators affiliated with Alibaba's Qwen lab created approximately 25,000 fraudulent accounts and generated more than 28.8 million interactions with Claude between April 22 and June 5, 2026. The interactions were concentrated on Claude's most advanced agentic reasoning and software engineering capabilities. The outputs from those interactions were used to train or fine-tune Alibaba's own models, a process known as knowledge distillation.
What is a distillation attack?
A distillation attack is a form of model extraction where an attacker uses the outputs of a larger, more capable "teacher" model to train a smaller "student" model, transferring capability at a fraction of the development cost. In legitimate ML research, distillation is a standard technique used by model owners to compress their own models. When applied to a competitor's model without authorization, it becomes an IP theft vector.
Can my company's model be extracted?
If you expose a model via API and return outputs that convey meaningful capability signals, yes, in principle. The practical extraction risk scales with how much capability your model encodes, how accessible your API is, and how closely you monitor for extraction-pattern behavior. Fine-tuned models exposed on internal APIs are lower-profile targets than frontier commercial models, but are not immune.
What is the best defense against model extraction?
No single defense is sufficient. The most practical combination is: aggressive rate limiting (especially on new or unverified accounts), API query monitoring for extraction-pattern behavior (systematic capability coverage, high volume, unusual distribution), and output watermarking where feasible. Longer term, cryptographic watermarking of model outputs is the most promising mechanism for both detection and legal attribution.
How do you detect a model extraction attack in progress?
Model extraction attacks are detectable through query pattern anomalies: unusually high volume from new or unverified accounts, systematic coverage of capability domains rather than organic task-driven queries, and low variance in query structure consistent with programmatic generation. The Alibaba operation, 28.8 million calls across ~25,000 fraudulent accounts, exhibited all three signals simultaneously. Rate limiting alone is insufficient because coordinated campaigns distribute load across many accounts to stay below per-account thresholds. Effective detection requires behavioral monitoring layered on top of rate limiting: flag accounts whose query patterns look programmatic rather than task-driven, and treat systematic capability probing (queries covering the full range of what your model can do, in sequence) as a high-confidence extraction signal. If you have an incident response workflow, treat a confirmed extraction attempt as an active security incident, not a policy violation to handle asynchronously.
How should engineering teams respond to a suspected model extraction attack?
A suspected model extraction attack should be treated as an active security incident: suspend or isolate the flagged accounts, preserve all query logs immediately, and engage legal counsel, because the evidentiary window for trade secret and terms-of-service claims is short. Once logs are secured, conduct a post-incident query analysis to assess extraction fidelity: what capability domains were covered, what volume was achieved, and whether the surrogate model is likely to perform within useful tolerance of your own. From there, close the access vector (stricter new-account rate limits, output truncation, enhanced monitoring) and assess whether output watermarking is warranted given the commercial value of the model IP at risk. The Anthropic-Alibaba case illustrates that disclosure, to regulators, committees, or the public, may also be a strategic response, particularly when the attack has policy implications beyond the immediate IP loss.



