Why Reddit licensing deal offers Google a data mine to push its luck

Social media platform Reddit on Thursday struck a licensing deal with Google, allowing the search giant to access Reddit users’ posts to train the company’s artificial intelligence (AI) engine. As part of the deal, Google will pay the social news aggregation site $60 million annually to access user-generated content from the platform.

This deal couldn’t have come at a better time for the two companies. Reddit wants cash and investor love ahead of its planned initial public offering (IPO). And Google is looking to save face from its AI misadventures.

Reddit generates revenue, but the company is not profitable. Its IPO document, filed with the U.S. stock market regulator, reveals a revenue of $804 million in 2023; most of it coming from advertisers. But, the platform suffered a net loss of $90.8 million.

Google’s annual paycheck to Reddit will provide the platform cash to make the company profitable. Plus, a data partnership with one of the biggest players in the business of AI can boost Reddit’s stature before its IPO, making investors find value in the platform in the age of AI where chatbots are ripping apart monolithic social platforms.

(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)

The licensing deal hands the Mountain View, California-based company a data mine to salvage itself from the AI wreck it’s in now.

What ails Google?

Google’s sporadic attempts to break OpenAI’s dominance in AI have left the search giant badly bruised. The company’s maiden AI chatbot Bard, launched as a rival to OpenAI’s ChatGPT, was faulty. It had factual errors in its first demo video; subsequent iterations weren’t academically gifted either.

Most recently, the company’s Gemini chatbot overcompensated for the lack of diversity by throwing up irrelevant images for queries. The company’s AI-based image generator showed a picture of a Black woman when queried ‘Who is the United States’ founding father?’ In another instance, it showed Asian persons as Nazi-era German soldiers. Such unintelligent responses have caused quite a stir.

Those blunders made the company’s top executive, overseeing its search business, Prabhakar Raghavan, apologise and note that the product “missed the mark”.

While these issues are tied to its large language model (LLM) and weights attached to tokens, the other challenge Google is facing is raw data – LLMs are data-hungry algorithms, and the quality of information flowing into it them matters a lot.

To be good at typing out accurate texts, Generative AI (GenAI) models first need to read copious amounts of texts. So long, tech firms had a free ride by scraping the web for text and using open-source crawling tools to sneak into websites and take data from those sites.

This modus operandi is being challenged as users and publishers are pushing back AI companies from scraping data from the web indiscriminately. In a proposed class action lawsuit, in July 2023, Google was accused of misusing a large amount of web users’ personal information to train its AI models.

Separately, in December, news publisher New York Times sued OpenAI and Microsoft for copyright infringement. The lawsuit claims that the AI firms used millions of its news articles to train the company’s AI model - - ChatGPT.

Such complaints from individuals and corporations are making lawmakers sit up and formulate policies on the ethical use of information available on the web.

Lawmakers in the U.S. filed a bill, the AI Foundational Model Transparency Act, that would require the Federal Trade Commission (FTC) and National Institute of Standards and Technology (NIST) to frame rules to report data transparency in AI models. This would in turn require builders of foundational AI models to disclose their sources of training data.

If such a law is passed, AI companies will have to compensate for using data to train their models. Consequently, cost of building AI models will go up. To pre-empt such a law, large tech firms are sealing up licensing deals with news publishers and other content sources. OpenAI’s deal with news agency Associated Press is a case in point.

Other news organizations, including Gannett (the largest U.S. newspaper company) and News Corp (the owner of The Wall Street Journal), have been in talks with OpenAI, per media reports. The publications that have cut a deal with AI companies will get a fee based on the frequency of their content being used.

How different is this deal?

It is against this context Google is making a deal with Reddit. But, unlike other platforms, Reddit works as a social news website, where content is socially curated and promoted. The platform is composed of hundreds of sub-communities, known as subreddits, where members submit content, which is then up- or down-voted by other members.

In the context of this deal, Google will have access to Reddit’s Data API, which will provide the search giant real-time, unique content from a large and dynamic platform. This will help the company’s AI model access behavioural and tending information data. And apart from this, Google will continue to access information from the web using crawlers.

But there is one catch with Reddit. In July 2023, when Reddit decided to introduce a new policy that charged some third-party apps for accessing data on its platform concerns over content moderation and accessibility arose. Several groups protested the changes proposed by Reddit. Over 8,000 subreddits went dark. Those subreddit groups, at the time, said the changes threatened to end the key way of historically customising the platform.

To avoid such a conflict this time around, Reddit is giving an unspecified number of its top users, including moderators and those with high karma scores, the chance to buy shares in its IPO, according to a report by The Verge.

Reddit plans to do it through an allocation system based on tiers. Individuals from tier one, will be certain users and moderators identified as those who have meaningfully contributed to Reddit community programmes. The second tier will be made up of people with a karma score of at least 2,000, a score that shows how much a user contributes to the Reddit community, and those who have performed at least 5,000 moderator actions.

This is an unusual move, as this privilege is usually reserved for professional investors who want to buy stock at a theoretically lower price before the stock is listed on an exchange. Reddit currently has some 267.5 million active weekly users, more than 100,000 active communities, and one billion total posts, according to its SEC filing.

Have other platforms used user data to train AI models?

Unlike Reddit, few platforms have been forthcoming on whether the public information of users is used to train AI models. X, formerly Twitter, in September, said it would use users’ posts to train AI models for the purposes outlined in its policy. The policy did not specify the AI model it referred to.

Meta said user data from its applications, including Facebook, Instagram, and Threads, would be used to train AI for its AI chatbot. While TikTok and Snapchat have both launched AI chatbots, neither has mentioned taking users posts to train AI models.

The practice of using user data to train algorithms is not new in the world of tech. Most of the platform’s recommender engine uses a person’s usage data to suggest videos, articles and movies. But using that information to train AI models is new and it calls for caution given these chatbots propensity to regurgitate personal information when it responds to prompts.

A case in point is Samsung banning the use of AI chatbots in its offices after it found that the bot spat out company secrets after employees used the application.

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.

Why Reddit licensing deal offers Google a data mine to push its luck
Premium