Books by famous authors such as J.K. Rowling, Amitav Ghosh, Rupi Kaur, and Neil Gaiman are part of a dataset of pirated books known as Books3, which has been used by large companies to train their generative AI models, according to a report by The Atlantic.
The news publication made available a searchable database of the dataset called ‘Books3,’ which would let people look up author names to see if their works were part of the dataset. Out of the 191,000 titles in total, the article noted that 183,000 had author information.
The Books3 dataset is said to have been used without permission by companies like Meta and Bloomberg to train their generative AI systems.
When using the search engine, looking up J.K. Rowling resulted in multiple entries because not just her English-language ‘Harry Potter’ books but also foreign language translations were included in the database. For other authors, only a few of their published novels were listed in the database.
(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)
Many authors on X (formerly Twitter) expressed outrage and shared screenshots which showed that their copyrighted novels were part of the list. Others suggested coming together for a class action lawsuit against the named companies.
The Books3 dataset has been mentioned in court filings by writers who sued Meta and OpenAI, alleging their copyrighted works were pirated and then scraped for AI training. Google has also been sued by plaintiffs for the same reason.
OpenAI has in the past defended the use of copyrighted media for AI training, claiming that the fair use doctrine protects such innovation.
Meanwhile, when Google released the Bard extensions for Google apps, the company said that information entered via Gmail and Docs would not be used for machine learning.
The article also warned there could be a chance of false positives appearing on the database.