OpenAI Scraped 100K Britannica Articles. The Bill: $15,000 Each.

A dictionary page made of 15 fused gold bars for Plagiarizing 'Plagiarize': The $15,000-Per-Article AI Copyright Suit evok...

Part 2 of 3 in the AI Copyright Wars series.

ChatGPT can define “plagiarize.” According to a federal complaint filed March 13, 2026, it does so by reproducing Merriam-Webster’s copyrighted definition word for word, without attribution. The resulting lawsuit may be the most self-illustrating AI copyright case ever filed.

Encyclopaedia Britannica , 258 years of editorial authority , and its subsidiary Merriam-Webster allege OpenAI scraped nearly 100,000 articles to train ChatGPT. At roughly 1,000 words per entry, that catalog contains approximately 100 million words , enough to occupy a dedicated reader for 833 eight-hour days, or 2.3 years of full-time work. ChatGPT consumed the entire corpus in a single training run and became a product delivering Britannica’s editorial output directly to users.

Not as commentary or review. As the reference product itself. Britannica brought a licensing proposal to OpenAI in November 2024. OpenAI declined.

OpenAI maintains its models are “trained on publicly available data and grounded in fair use.” But Britannica’s filing introduces a structural claim the other 90 cases have not. It describes a “downward spiral” where AI trains on authoritative content, replaces those sources as the primary access point, and starves publishers of the revenue funding future knowledge creation. If that framing survives a motion to dismiss, fair use has a ceiling it never had before.

What Reference Publishers Can Argue That Others Cannot

Every AI copyright case alleges copying. Ninety-one federal lawsuits in U.S. courts now make that claim against AI companies. What separates Case No. 1:2026cv02097 is not the allegation but the market being displaced.

News publishers sell perspective, investigation, editorial voice , outputs a language model cannot replicate because the original serves a function a summary does not. A dictionary has one function: delivering the correct answer to a defined question. When ChatGPT delivers Merriam-Webster’s definition of “plagiarize” verbatim, it is not competing with the dictionary’s perspective. It is replacing the dictionary.

Beyond definitions, the complaint details how users asking for “10 Things You Need to Know About the Hamilton-Burr Duel” received ChatGPT responses that “reproduced an identical specific selection and ordering of quotes found in a copyrighted Britannica article.” Not a summary. Not a paraphrase. An identical editorial structure delivered through a competing interface.

Britannica discontinued its 32-volume print edition in 2012 and rebuilt as a digital reference service funded by subscriptions and advertising , a decade-long transition that staked everything on users arriving at Britannica.com. ChatGPT intercepts them before they get there. “ChatGPT starves web publishers, like Plaintiffs, of revenue,” the complaint states. This is not Britannica’s first AI lawsuit , CEO Jorge Cauz pledged to “take all steps necessary to protect the data and intellectual capital” when suing Perplexity. Two suits against two AI platforms is not indignation. It is coordinated legal strategy.

A Lanham Act trademark claim , rare in generative AI litigation , sharpens the threat. When ChatGPT hallucinates while appearing to deliver reference-quality information, it damages a brand built over 258 years. A hallucinated New York Times citation is embarrassing. A hallucinated Britannica definition is brand destruction.

But if the structural case is this distinct, the licensing rejection demands explanation.

Why One Rejection Protects a $730 Billion Defense

Britannica approached OpenAI with a licensing proposal in November 2024. OpenAI declined. In March 2026 alone, News Corp signed a deal with Meta worth up to $50 million per year, and UK publisher Reach struck a usage-based agreement with Amazon. The market has decided curated content has a price. OpenAI is the holdout , not because it disagrees with the price, but because of what paying would concede.

The numbers in this case reveal the scale of that calculus. News Corp , a global media conglomerate spanning thousands of titles , secured up to $50 million per year from Meta. Britannica’s catalog is smaller but more specialized: 100,000 individually curated, expert-reviewed reference articles. A proportional licensing deal might have run $5 million to $15 million annually. Over a five-year agreement, that is $25 million to $75 million , between 0.003% and 0.01% of OpenAI’s $730 billion valuation. But the Bartz v. Anthropic settlement priced AI training liability at $1.5 billion for a single case. That puts the rejection-to-exposure ratio between 20:1 and 60:1 , OpenAI risked twenty to sixty times what a Britannica license would have cost. And that ratio only accounts for one plaintiff. Applied across 12+ consolidated lawsuits in the MDL, each representing catalogs far larger than Britannica’s, the aggregate exposure makes the arithmetic existential. OpenAI did not reject a $15 million annual deal. It rejected the concession that fair use has limits , a concession worth infinitely more than whatever Britannica would have charged, because it would become evidence in every other case on the docket.

“ChatGPT helps enhance human creativity, advance scientific discovery and medical research,” an OpenAI spokesperson told Fortune, adding that its models are “trained on publicly available data and grounded in fair use.” The first claim is aspiration. The second is the legal defense. And within it, the phrase “publicly available” is doing all the structural work , if training data is raw material accessible to anyone, then no publisher owns the inputs to a language model, and the licensing question vanishes.

That framing deserves scrutiny. Publicly available is not the same as public domain. A library book is publicly available. Photocopying its contents for commercial distribution is not fair use. OpenAI’s defense conflates access with authorization , a distinction that previous AI copyright cases, focused on creative works, have not forced courts to adjudicate.

But this case is not only about who owns what was copied. It is about what happens after the copying , and here the reader’s understanding of the dispute must shift. The ownership question asks who gets paid. The structural question asks whether the thing being paid for can survive.

Britannica’s complaint and the broader pattern of AI training data disputes describe what practitioners call “The Credibility Cannibalism”: a feedback loop where AI systems consume the authoritative sources they need to be trustworthy, destroying the economic basis for those sources to exist , with no built-in brake.

The loop runs like this. Britannica employs subject-matter experts to write and maintain a curated article on the Hamilton-Burr duel. OpenAI scrapes the article and trains ChatGPT on it. A user asks ChatGPT about the duel and receives Britannica’s “identical specific selection and ordering of quotes” without ever visiting Britannica.com. Britannica loses the subscription revenue and advertising impressions that paid the editor who wrote the article. With less revenue, Britannica employs fewer editors, updates articles less frequently, integrates less new scholarship. The next training run ingests a knowledge base that is slightly degraded , but imperceptibly, because no user has the original to compare against. The authority that made the training data valuable is consumed in the act of extraction.

Unlike a mine that runs out of ore visibly, a knowledge base degrades silently. The last training run on high-quality Britannica data will look identical to the first training run on post-decline data. By the time the degradation surfaces in model outputs, the editorial infrastructure that could have corrected it no longer exists.

Britannica’s “downward spiral” framing is strategically compelling but overlooks counterevidence in its own litigation market. The News Corp–Meta and Reach–Amazon deals show that publishers with sufficiently valuable catalogs can negotiate licensing revenue rather than simply losing it , suggesting the spiral has market-based exits that a complaint designed for maximum judicial sympathy has reason to omit. Yet those deals also prove the underlying premise: content licensing exists precisely because the companies writing the checks acknowledge they are extracting value that someone else created. OpenAI is the only major AI company refusing to write that check , and as the 20:1 rejection-to-exposure ratio shows, it is not a financial decision. It is a legal one.

The Credibility Cannibalism has a price. No one is paying it.

The Copyright Math Nobody Published

Only one AI training dispute has produced a dollar figure. Bartz v. Anthropic settled for $1.5 billion , the largest The copyright dispute resolution to date, covering pirated books used as training data. Applied to Britannica’s catalog of approximately 100,000 articles, the arithmetic yields a figure neither side has published: $1.5 billion / 100,000 articles = $15,000 per article.

Data visualization calculating $15,000 per-article liability from $1.5 billion settlement divided across 100,000 articles
Breaking gold bar with gavel showing $15,000 per article calculation

That number is both too high and too low. Too high because Anthropic’s settlement involved pirated material , a stronger factual basis than unauthorized scraping. Too low because Britannica’s articles are individually curated reference works requiring subject-matter expertise and editorial review, not mass-market paperbacks. Either way, the figure benchmarks something no court has priced: the per-unit value of authoritative knowledge consumed by a language model.

For reference publishers, the cost of inaction is asymmetric. If the MDL produces a favorable fair use ruling before publishers have asserted their claims, curated knowledge gets priced at zero , not as a negotiating position but as binding precedent. Publishers who do not join the MDL before precedent is set forfeit the only window where their catalogs have use.

For CTOs building on generative AI APIs, the ruling has supply-chain consequences: an adverse fair use precedent does not just affect Britannica. It reprices every training dataset in every model behind every API call.

Regulatory fragmentation compounds the uncertainty. Article 50 of the EU AI Act requires transparency about AI-generated content. The UK reversed its This dispute position entirely under £146 billion in industry lobbying. The U.S. has no framework at all. Three jurisdictions, three answers, and a structural incentive for AI companies to train wherever the law bends farthest.

One limitation bears stating directly: this analysis draws primarily on allegations in Britannica’s complaint , a document drafted to maximize legal advantage, not to present balanced facts. Independent verification of the 100,000-article scraping claim would require access to OpenAI’s training data manifests, which remain undisclosed.

OpenAI’s fair use defense is not frivolous. Transformative use doctrine has expanded with each major revision since the Copyright Act of 1976. “A lot of lawyers for the AI industry strongly believe that the fair use doctrine coming out of [Authors Guild v Google] is going to protect the training purposes that are behind the ML model,” said Shyamkrishna Balganesh, professor at Columbia Law School. “Myself, I’m not sure that that’s an open-and-shut case.” But fair use was designed for a world where the derivative served a different audience or purpose. When the derivative delivers the original’s exact function , a dictionary definition, an encyclopedia entry , to the original’s exact audience, “transformative” describes something the doctrine was never built to accommodate. ChatGPT has already been sued for practicing law. Whether it can also be held liable for practicing lexicography is a question no prior case has raised.

By March 2027, Judge Stein’s consolidated ruling will likely have landed. If fair use prevails, the licensing market for AI training data collapses , and with it, publishers’ use to price their catalogs. If it does not, every AI company faces retroactive exposure across their entire training corpus. The 20:1 rejection ratio will look either like the shrewdest legal strategy in tech history or the most expensive refusal to sign a check.

Merriam-Webster has refined its definition of “plagiarize” across editions for over a century. Whether anyone will be paid to write the next revision depends on a question no court has answered: when the machine that defines the word commits the act, and the institution that wrote the definition cannot survive the economics, who edits the next dictionary? Case No. 1:2026cv02097 will not merely resolve an The legal challenge dispute. It will determine whether authoritative knowledge has a market price , or an expiration date.

What to Read Next

References

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top