Books by J.K. Rowling, Amitav Ghosh part of 183,000-book dataset used for AI training: Report

The Books3 dataset is said to have been used without permission by companies like Meta and Bloomberg to train their generative AI systems

September 26, 2023 02:55 pm | Updated 05:05 pm IST

The Atlantic made available a searchable database of the dataset called ‘Books3’ [File]

The Atlantic made available a searchable database of the dataset called ‘Books3’ [File] | Photo Credit: REUTERS

Books by famous authors such as J.K. Rowling, Amitav Ghosh, Rupi Kaur, and Neil Gaiman are part of a dataset of pirated books known as Books3, which has been used by large companies to train their generative AI models, according to a report by The Atlantic.

The news publication made available a searchable database of the dataset called ‘Books3,’ which would let people look up author names to see if their works were part of the dataset. Out of the 191,000 titles in total, the article noted that 183,000 had author information.

The Books3 dataset is said to have been used without permission by companies like Meta and Bloomberg to train their generative AI systems.

When using the search engine, looking up J.K. Rowling resulted in multiple entries because not just her English-language ‘Harry Potter’ books but also foreign language translations were included in the database. For other authors, only a few of their published novels were listed in the database.

(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)

Many authors on X (formerly Twitter) expressed outrage and shared screenshots which showed that their copyrighted novels were part of the list. Others suggested coming together for a class action lawsuit against the named companies.

The Books3 dataset has been mentioned in court filings by writers who sued Meta and OpenAI, alleging their copyrighted works were pirated and then scraped for AI training. Google has also been sued by plaintiffs for the same reason.

OpenAI has in the past defended the use of copyrighted media for AI training, claiming that the fair use doctrine protects such innovation.

Meanwhile, when Google released the Bard extensions for Google apps, the company said that information entered via Gmail and Docs would not be used for machine learning.

The article also warned there could be a chance of false positives appearing on the database.

0 / 0
Sign in to unlock member-only benefits!
  • Access 10 free stories every month
  • Save stories to read later
  • Access to comment on every story
  • Sign-up/manage your newsletter subscriptions with a single click
  • Get notified by email for early access to discounts & offers on our products
Sign in

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.