If data scraping is no secret, why does it rattle Elon Musk? 

Twitter owner Elon Musk is concerned by data scraping for the purpose of training AI models, but is there cause for alarm?

July 07, 2023 04:18 pm | Updated July 11, 2023 12:00 pm IST

Elon Musk, pictured here, has imposed limits on Twitter usage, citing data scraping [File]

Elon Musk, pictured here, has imposed limits on Twitter usage, citing data scraping [File] | Photo Credit: Reuters

How much do your personal opinions and insights mean to you? If you asked any tech company building an AI model, the answer would be a lot.

For most of these companies, data scraping from sites like Twitter, Reddit and Wikipedia has become one of the primary ways to collect data, especially given how easy it is. They make a specialised bot which simply cherry picks the appropriate information on a page and converts it into a structured database like a spreadsheet that is easily consumed.

But in the recent past, ethical questions around the technique have come to light. On July 1, Twitter owner Elon Musk said he was limiting the number of tweets that could be read per day for various accounts, blaming the decision on “extreme levels” of data scraping and system manipulation.

Musk has been miffed with AI companies like ChatGPT-maker OpenAI for a while now. In December, a New York Times report stated that just weeks after the release of the viral chatbot, Musk had shut down OpenAI’s access to the platform. Two sources revealed that Musk apparently felt that the $2 million that OpenAI was paying to license Twitter’s data every year wasn’t enough.

(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)

He has often tweeted expressing his infuriation. In a tweet posted few days back, Musk stated, “Per my earlier post, drastic & immediate action was necessary due to EXTREME levels of data scraping. Almost every company doing AI, from startups to some of the biggest corporations on Earth, was scraping vast amounts of data. It is rather galling to have to bring large numbers of servers online on an emergency basis just to facilitate some AI startup’s outrageous valuation.”

Musk isn’t the only one irked about data scraping. On April 18, the wormhole-like social news website Reddit announced it wanted to be paid for access to its API or application programming interface, the method which allows users to download its troves of discussions.

A lot of these questions around the murky area of data privacy have found themselves at OpenAI - the EU is still drafting rules for its proposed AI Act to regulate technologies like ChatGPT. On June 28th, a class action lawsuit was filed against the Sam Altman-led company and its partner Microsoft, alleging that it is “harvesting massive amounts of personal data from the internet” like private conversations, medical data, and more, without requesting user permission. Calling data scraping “illegal,” the plaintiff asked that OpenAI provide users with the option to opt out of data collection if they wish.

Meanwhile, on July 3, Google stealthily updated its privacy policy to include data scraping as an admissible method to collect data.

In the part of the policy discussing “publicly accessible sources,” the search giant stated it “uses information to improve our services and to develop new products, features, and technologies that benefit our users and the public,” going on to clarify that it would use data scraping.

“For example, we use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities,” it said.

Notably, India’s upcoming Digital Personal Data Protection Bill which sits on the principle of consent, leaves out data scraping from its purview.

Kazim Rizvi, founder of the public policy think-tank, ‘The Dialogue’ explains just why the practice is a problem.

“This could also cause a fall through the cracks as the determining legitimacy of consent is nebulous in data scraping exercises. There is a lack of clarity regarding how the data scraped will be safeguarded at the level of processing, data storage, and whether it will be deleted at the end of processing. There is a lack of clarity regarding the data chain - who will have access to the collected data and whether it will be shared with a third party, etc. 

Data collected through the means of data scraping are not safeguarded against misuse like profiling individuals and communities. This in itself may not be onerous, but considering inference generated from profiling as a single source of truth without any alternative could bring out adverse implications, including exclusion,” he added.

(This article has been updated)

0 / 0
Sign in to unlock member-only benefits!
  • Access 10 free stories every month
  • Save stories to read later
  • Access to comment on every story
  • Sign-up/manage your newsletter subscriptions with a single click
  • Get notified by email for early access to discounts & offers on our products
Sign in

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.