Does Katniss Everdeen, the narrator and hero of the bestselling Hunger Games trilogy, ever have children?
Citing the epilogue of the final book, ChatGPT correctly replied that Katniss has a daughter and a son. Though it made up imaginary names for the children who went unnamed in the book, the AI chatbot’s detailed response to similar questions about novels is enough proof for a number of authors who claimed that their copyrighted books were part of the data set which popular chatbots were trained on.
How could ChatGPT, Bard, and other AI models and chatbots access thousands of books for their training? According to authors, one major source of this data was book piracy, or free copies of e-books on ‘shadow libraries.’
On June 30, The Authors Guild, an advocacy group for writers, shared an open letter signed by over 5,000 individuals and addressed to the leaders of several companies involved in AI development. These heads were OpenAI CEO Sam Altman, Alphabet CEO Sundar Pichai, Meta CEO Mark Zuckerberg, Stability AI CEO Emad Mostaque, IBM CEO Arvind Krishna, and Microsoft CEO Satya Nadella.
In the letter, authors alleged that the tech companies exploited their work without “consent, credit, or compensation,” in order to train their AI large language models. It claimed that generative AI risked damaging their profession by overpowering the market with mediocre and imitative work. The Authors Guild called for tech companies to obtain permission before using copyrighted work for AI training, and stressed on compensating past and future writers for using their work. The signatories included The Hunger Games trilogy author Suzanne Collins, My Sister’s Keeper author Jodi Picoult, A Series of Unfortunate Events author Daniel Handler (writing as Lemony Snicket), and The Handmaid’s Tale author Margaret Atwood.
The letter also called attention to websites holding pirated copies of books.
(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)
“We understand that many of the books used to develop AI systems originated from notorious piracy websites,” said the letter, adding that a court would not “excuse copying illegally sourced works as fair use.”
The Authors Guild is not the only group to accuse book piracy websites of enabling data scraping to train AI models. In a lawsuit brought against Google, eight plaintiffs accused the search engine giant of stealing their digital data for years to train Bard and other AI products. They accused Google-parent Alphabet, Google DeepMind, and Google LLC of obtaining copyrighted, creative, and even paid data without compensating users.
“As part of its theft of personal data, Google illegally accessed restricted, subscription-based websites to take the content of millions without permission and infringed at least 200 million materials explicitly protected by copyright, including previously stolen property from websites known for pirated collections of books and other creative works,” said the court filing dated July 11, claiming that Bard summarised one plaintiff’s published book chapter-by-chapter and could regenerate its text “verbatim.” The filing claimed Bard was trained on a “stolen PDF” of the book.
In their own lawsuit against ChatGPT-maker OpenAI, dated June 28, authors Paul Tremblay and Mona Awad claimed that the books used by OpenAI to train its AI models included their copyrighted works, along with texts from both public domain sources such as Project Gutenberg as well as illegal shadow libraries which contained pirated copies of e-books.
Speaking about one data set of books used by OpenAI, the authors’ filing alleged, “ [...] the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems.”
Comedian Sarah Silverman and authors Christopher Golden and Richard Kadrey also sued Meta and OpenAI, claiming that the companies infringed copyrights. Their filings, dated July 7, alleged a potential link between AI model training data sets and shadow libraries, based on the size of the data sets used by the companies.
Old laws versus new tech
Alexandra Elbakyan, a programmer and advocate for open knowledge, founded the Sci-Hub platform in 2011 in order to let internet users access paywalled academic texts and subscription-based papers for free. Elbakyan was sued multiple times and Sci-Hub has been reported to authorities worldwide by academic publishers who accuse it of enabling piracy and violating copyrights.
While Elbakyan said she didn’t know about novels being used to train AI models, she stated that Sci-Hub’s databases have been mined over the course of more than a decade for purposes which might include the training of large language models (LLMs).
“Because people wrote to me and requested [me] to provide data for their research project in AI and data mining. And they continue to do that,” she told The Hindu by email. She explained that Sci-Hub’s databases are open, so anyone can mine them, but she sometimes helps out users with specific extraction requests.
As for training AI products, Elbakyan pointed to the release of Meta’s Galactica model in November 2022. The model was reportedly trained on around 48 million scientific and academic texts, but was stopped after a few days because of the incorrect information it generated.
“The authors do not mention Sci-Hub, but where would they get all these papers from?” Elbakyan wondered.
While Sci-Hub mostly provides academic texts to users, other platforms such as LibGen and Z-Library have millions of books and other copyrighted or paywalled creative works. Elbakyan explained that the idea of training advanced tech tools on existing databases of knowledge is an old idea, but that it might lead to breakthroughs.
“Copyright shouldn’t be an obstacle in the way of progress,” she said, noting that she also raised funds in order to bring AI to Sci-Hub.
The Hindu prompted Google’s Bard on whether The Hunger Games’ hero Katniss ever had children. Like ChatGPT, Bard cited the epilogue of the third book to confirm that Katniss did indeed have two children. However, Bard credited its answer to a legal, online source: the fandom wiki for The Hunger Games, which carries in-depth information about the books, characters, and their story arcs.
In other words: not a pirated books platform.
At the same time, when asked to generate the epilogue of The Hunger Games verbatim, Bard produced a result which was almost identical to the book’s ending, though in the third-person narrative voice.
The authors’ lawsuits will help answer some questions on whether authors can defend their copyrighted work and to what degree tech firms can use such content. Tech firms may also be forced to make AI data sets more transparent, so the public can better understand the role played by shadow libraries in enabling the development of AI models.
That being said, the relationship between AI tech companies and writers is only set to become thornier in the near future.
This article was updated with an additional comment from Alexandra Elbakyan.