Leading publishers, including Encyclopedia Britannica (owner of Merriam-Webster), have launched a lawsuit against OpenAI, alleging systematic and widespread copyright violations. The core claim is that OpenAI illegally scraped and used nearly 100,000 copyrighted articles to train its large language models (LLMs) without permission.

The Case: How OpenAI Allegedly Infringed on Copyright

Britannica argues that OpenAI’s actions go beyond simple data collection. The lawsuit specifically accuses the AI giant of two key violations:

  1. Direct Reproduction: OpenAI’s models allegedly generate outputs that contain verbatim copies of Britannica’s content.
  2. Retrieval-Augmented Generation (RAG) Abuse: OpenAI’s RAG tool, which enhances ChatGPT’s responses with real-time web data, incorporates Britannica’s articles without authorization. This essentially means OpenAI profits from Britannica’s work while undermining its revenue streams.

The complaint also states that OpenAI violates trademark law by fabricating false attributions. ChatGPT is accused of generating “hallucinations” (false information) and falsely linking them to Britannica, damaging the publisher’s credibility. Britannica contends that this practice not only harms its bottom line but also erodes public trust in reliable online sources.

A Growing Trend: Publishers vs. AI

Britannica is not alone in this legal battle. The New York Times, Ziff Davis (parent company of Mashable, CNET, and others), and over a dozen newspapers across North America have already filed similar suits against OpenAI. A separate lawsuit against Perplexity, another AI company, remains unresolved.

The central question driving these cases is whether training an LLM on copyrighted material constitutes fair use. While there’s no firm legal precedent, Anthropic previously argued in court that such use is “transformative” and legal. However, the judge in that case found that illegally downloading content (rather than licensing it) was a clear violation, leading to a $1.5 billion settlement.

Why This Matters

These lawsuits are significant because they challenge the fundamental business model of many AI companies. LLMs rely on massive datasets, often including copyrighted material, to function. If courts rule consistently in favor of publishers, AI developers may need to renegotiate data acquisition strategies or face crippling legal costs. The outcome will shape how AI systems are trained and used, potentially forcing a shift towards licensed content and stricter data controls.

OpenAI has yet to respond to the allegations, but the legal pressure is mounting. The future of AI training may depend on how these cases unfold.