Policy | 6/27/2025

Pulitzer Winner Takes on Microsoft Over AI's Use of Pirated Books

A group of authors, including Pulitzer Prize winner Kai Bird, is suing Microsoft, claiming the tech giant trained its AI on 200,000 pirated books. This lawsuit raises big questions about copyright and the ethics of using protected works in AI development.

Pulitzer Winner Takes on Microsoft Over AI's Use of Pirated Books

So, here’s the scoop: Microsoft is in hot water with a bunch of authors who are saying the tech giant used their books without asking. Yup, you heard that right! They filed a lawsuit in New York, claiming that Microsoft trained its AI on a dataset that included nearly 200,000 pirated books. Can you imagine? That’s a lot of reading material!

The lawsuit is led by Pulitzer Prize winner Kai Bird, along with other notable writers like Jia Tolentino and Daniel Okrent. They’re saying Microsoft went ahead and used a dataset called "Books3" to train its AI model, known as the Megatron-Turing Natural Language Generation model (MT-NLG). This model is supposed to generate text that sounds super human-like, but the authors argue it’s built on their hard work without any permission or payment. Talk about a plot twist!

The authors are claiming that Microsoft’s actions are a clear case of copyright infringement. They allege that this dataset, which has since been taken down after a complaint from a Danish anti-piracy group, contained around 196,640 pirated e-books. By using their work to train the AI, they argue Microsoft has created a system that can mimic their unique writing styles and themes—basically, it’s like making knock-off versions of their books without giving them a dime.

Now, they’re not just looking for a slap on the wrist. They’re asking for statutory damages of up to $150,000 for each book that was infringed upon, plus a court order to stop Microsoft from using their material. That’s some serious cash!

But wait, there’s more! The MT-NLG model is a big deal in the AI world. Developed with NVIDIA, it’s one of the largest language models out there, boasting 530 billion parameters. This means it can understand language in a pretty nuanced way, which is why it’s so good at tasks like text prediction and reading comprehension. But the authors are saying that this impressive capability is built on a shaky foundation of illegally copied works.

Microsoft and NVIDIA have acknowledged that their model can pick up biases from its training data, but the lawsuit is really digging into whether it was even legal to use that data in the first place. The authors believe Microsoft chose to use pirated content to avoid paying licensing fees, which is a pretty bold move if you ask me.

This lawsuit isn’t just a one-off; it’s part of a larger battle between authors and tech companies over how copyrighted material is used in AI training. Companies like Meta, Anthropic, and OpenAI have faced similar lawsuits. Their defense? They often cite the "fair use" doctrine, claiming their use of copyrighted works is transformative and necessary for innovation. But the authors are pushing back, saying this kind of use undermines their ability to profit from their creations.

The legal landscape is super murky right now. In one recent case, a judge said that while training AI on copyrighted books could be fair use, the company could still be liable for using pirated versions. This distinction is crucial for the case against Microsoft, as the authors are specifically alleging the use of pirated content.

So, what’s at stake here? If the authors win, it could force AI developers to rethink how they source and license their training data, which might slow down development and increase costs. On the flip side, if Microsoft wins, it could set a precedent that using publicly available data is fair game, speeding up AI development but potentially trampling on creators’ rights.

As these legal battles unfold, they’re gonna shape the future of AI and how we think about creativity and ownership in the digital age. It’s a wild ride, and I can’t wait to see how it all plays out!