AI Research | 6/6/2025

The Common Pile: A New Era for Ethical AI Training

The Common Pile, an 8TB dataset composed of openly licensed and public domain texts, offers a legally sound alternative for training large language models. Developed by a consortium of researchers, it addresses copyright concerns and promotes transparency in AI development.

Introduction of The Common Pile

A significant advancement in artificial intelligence development has been achieved with the launch of The Common Pile, an extensive eight-terabyte text dataset. This dataset is constructed entirely from openly licensed and public domain sources. The initiative is a collaborative effort involving researchers from EleutherAI, the University of Toronto, Vector Institute, Hugging Face, the Allen Institute for Artificial Intelligence, and other institutions. It aims to provide a transparent and legally sound alternative to the vast quantities of web data often restricted by copyright issues, which are currently used to train large language models (LLMs).

Addressing Industry Challenges

The introduction of The Common Pile comes at a crucial time for the AI industry, which is under increasing scrutiny and facing legal challenges over data acquisition methods. This project represents a significant step towards fostering more ethical and open practices in AI system development. The dataset's creation was a meticulous two-year process, driven by the need for high-quality, large-scale datasets free from legal ambiguities.

Evolution from "The Pile"

Four and a half years before this release, EleutherAI had introduced "The Pile," an 800GB dataset that was groundbreaking for its time. However, it still navigated a complex data landscape. The Common Pile v0.1 expands significantly on this earlier work, not just in scale but in its strict adherence to open licensing and public domain content. The dataset includes content from 30 diverse sources, such as research papers, open-source code, government documents, and historical books.

Ethical and Legal Considerations

The development of The Common Pile challenges the widespread industry practice of using unlicensed text for training LLMs, a method that has led to numerous lawsuits and reduced transparency from AI developers. Many companies have historically scraped data from the internet without permission, leading to accusations of intellectual property infringement. The Common Pile mitigates these risks by ensuring all data is permissively licensed or in the public domain, promoting data transparency essential for scientific research.

Performance and Future Prospects

A common concern is whether models trained on openly licensed text can match the performance of those trained on unlicensed datasets. To address this, the creators of The Common Pile trained two 7-billion-parameter LLMs, Comma v0.1-1T and Comma v0.1-2T. These models performed comparably to leading models trained on unlicensed data, demonstrating that adherence to open licensing does not compromise model quality. The public release of these models and their training data allows for independent verification and further research.

Conclusion

The Common Pile represents a landmark achievement in building a transparent, ethical, and legally robust foundation for AI research and development. By curating eight terabytes of text from openly licensed and public domain sources, its creators address copyright concerns and transparency deficits troubling the AI field. The initiative sets a new standard for future dataset creation, encouraging collaboration and a responsible approach to AI development.