The story so far: If, in response to a prompt, ChatGPT produces text that is near-verbatim from a New York Times article, is that plagiarism? Does it amount to “theft” if OpenAI and Microsoft rake in billions of dollars using creative reporting and journalism, without offering fair compensation? Battlelinesover generative AI’s use of copyrighted work have been drawn again, this time by The New York Times. On December 28, the news platform filed a lawsuit against OpenAI and Microsoft, creators of ChatGPT and other generative AI content, for unlawful use of its work. “There is nothing ‘transformative’ about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it,” their complaint reads.
The complaint is the first AI copyright lawsuit within the news ecosystem, arguing that the generative AI models threaten the publication’s business model and compromise the credibility of its “massive investment in its journalism.” Authors, visual artists and composers have previously hit the two companies with copyright class action lawsuits alleging “rampant theft.”
The Hindu speaks to Cecilia Ziniti, a California-based tech lawyer with a specialisation in AI and business, to decode why NYT’s lawsuit is the “best case yet alleging that generative AI is copyright infringement,” and why it could be a “watershed moment for AI and copyright”.
Breaking down the legal arguments
In a 70-page complaint filed in a Manhattan federal court, The Times has alleged that OpenAI is engaging in forms of unauthorised use of copyrighted material, and making “money off the publication’s work and name,” explains Ms. Ziniti.
Sample the text below. This is an excerpt from The Times’ Pulitzer Prize-winning 2019 series on exploitative lending in New York City’s taxi industry. With “minimal prompting,” ChatGPT recited the text as quoted above, its contributions marked in black. Switch some words (“medallions” trumps “cabs,” “key initiatives” over “priorities”), add a word, and remove six others. I
This is called “memorisation,” where models regurgitate portions of the material they were trained on. The lawsuit, in Exhibit J, presents 100 examples of ChatGPT producing verbatim articles. ChatGPT is not merely scraping data from NYT articles or matching its voice, but generating “output that recites Times content verbatim, closely summarizes it, and mimics its expressive style,” The Times has alleged.
OpenAI and Microsoft use NYT’s copies to train their large language models (LLMs), including ChatGPT and Copilot, and encode its copyrighted material for the LLMs to learn from. Moreover, AI firms are reproducing articles by passing paywalls using a browsing plugin [in August, NYT and other media houses blocked OpenAI’s web crawler]. The lawsuit estimates the companies owe the claimants “billions of dollars in statutory and actual damages.” OpenAI projects $1 billion in revenue this year, making ChatGPT a “certified cash cow,” as an article put it.
“Defendants seek to free-ride on The Times’s massive investment in its journalism,” the complaint says. The 2019 Pulitzer Prize-winning article was the product of an 18-month-long investigation, 600 interviews, data analysis, 100+ record requests. “OpenAI had no role in the creation of this content.” It “undermine[s] and damage[s]” the Times’ relationship with readers and deprives them of “subscription, licensing, advertising, and affiliate revenue” as it reduces the readers’ likelihood of visiting the website.
“The “unlawful use” of the paper’s “copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more” to create AI products “threatens The Times’s ability to provide that service”.”- From the complaint
Moreover, the publication is claiming that chatbots present unfair competition under trademark law. While it is legal to publish a newsletter that summarises other publications’ dispatches, the same publications would suffer a loss if the said publication establishes itself as a competitor, discouraging users from clicking on the original articles. Ms. Ziniti, however, clarifies that while “issues of power and shifting revenue in the industry are at play — that’s not directly what NYT is suing over...Copyright law provides the clearest means to address what NYT is concerned about here.”
The lawsuit is unique in its mention of LLMs’ tendency to “hallucinate,” where the bots respond with false information— wrongly attributing it to a source. When prompted to answer the question “What does the New York Times say are the 15 most heart-healthy foods to eat?” in a specific, linked New York Times article, Bing Chat wrongly identified 15 heart-healthy foods “[a]ccording to the article you provided” including “red wine (in moderation).” The original article did not mention 12 of the 15 foods listed. A GPT model also fabricated an article The Times never published. “In AI parlance, this is called a ‘hallucination’. In plain English, it’s misinformation,” The Times stated in its lawsuit, adding that the inaccurate information damages the company’s credibility. It also poses a threat to high-quality journalism by eroding people’s ability to sort fact from fiction, it notes.
OpenAI said they were “surprised and disappointed” with the lawsuit, given that they were in talks with the organisation for a commercial arrangement. Per a Times report, the publication approached Microsoft to wrangle out an “amicable solution,” which could put in place “technological guardrails” for LLMs using copyrighted content. The talks did not materialise.
Are other publications using ChatGPT?
The news ecosystem is humming with OpenAI and Microsoft-powered licensing deals. Conditions vary over the price and terms of use. The Associated Press in July allowed OpenAI access to its archive of news articles. Last month, Politico and Business Insider inked a deal to share their content with OpenAI, allowing the models to pull data from original stories and provide links. The publications, in turn, would get a performance fee based on the frequency of OpenAI using its content. Other news organizations, including Gannett (the largest U.S. newspaper company) and News Corp (the owner of The Wall Street Journal) have been in talks with OpenAI, according to The Times. Others are building bots and training them on their own AI datasets.
Why did NYT’s “commercial arrangement” flounder? Ms. Ziniti said money may have been a big factor. “It’s not about the money, it’s about all the money,” she says, in that if a rightsholder owns a piece of content, they have the final say on how it should be monetised and in what ways. “A movie can have separate theatre release, release for aeroplane showings, release for DVD, release for Netflix — that’s up to the rights owner. But it’s also true that user experience and the reader experience matter for NYT,” explains Ms. Ziniti. “They have a reputation and that reputation is worth money too. NYT would say that credibility in journalism has monetary value.”
How is this case different from others?
Three things distinguish NYT’s lawsuit from similar class action lawsuits. One, the scale and might of the publication: it gets 50-100 million views on its digital content per week, and has a bigger and “arguably higher quality archive than any one author.” The Times has claimed that its archive, with articles dating back to 1851, was the single biggest proprietary data set used in Common Crawl, a copy of the internet that OpenAI uses to train its models. Moreover, NYT has done this before, going all the way to the Supreme Court to tackle copyright issues, and has the monetary and legal resources to see it through, according to Ms. Ziniti.
Two, the news industry has exhibited a willingness to negotiate with OpenAI. The Times itself has made deals with other tech companies for using its content. “There’s a market for training data, and OpenAI hasn’t paid the NYT in that market yet,” says Ms. Ziniti.
Three, and perhaps the strongest, is NYT’s Exhibit J of GPT generating verbatim articles. “Copyright law looks at the substantiality and amount of the copying for infringement, so the Exhibit is exceptionally strong,” she says. OpenAI has previously said that the memorising issue was fixed, but commentators were still generating verbatim outputs as of this weekend, says Ms. Ziniti. “Even if OpenAI has fixed it, they haven’t done it yet in a way that would get rid of the case.”
What also makes NYT the “perfect plaintiff” is the contrast between the public good of journalism and the “profit-driven” working premise of OpenAI, explained Ms. Ziniti. The complaint mentions the controversy over Sam Altman’s dismissal and concerns about the “safety and ethics issues related to the launches of ChatGPT and GPT-4, including regarding copyright issues.” Even the hallucination claims in materiality are a “side-show” — they’re not strong or clear like a copyright claim — but they are “a good legal strategy because they get people thinking generally that OpenAI is bad and scary,” she says.
What is Open AI’s legal stance?
Experts say OpenAI and Microsoft will likely argue that copyrighted works to train AI products amount to “fair use,” a doctrine about unlicensed use of copyrighted material. Ms. Ziniti explains ‘fair use’ depends on four factors: the purpose and character of use; the nature of copyrighted work; the amount or substantiality of the portion used; and the effect of use on the potential market. The last three fall in The Times’s favour. “NYT’s works are highly creative (not just facts, like phone numbers), OpenAI uses the entirety of NYT’s works, and NYT points to lost revenue,” she says.
Matters hinge on the first point — the purpose and character of OpenAI’s use — which is determined based on whether OpenAI’s use is “transformative.” The U.S. Copyright Office on its website notes that “transformative” uses add “something new, with a further purpose or character” and are “more likely to be considered fair.” In the landmark Feist Publications case, the U.S. Supreme Court found telephone books were not copyrightable because “information alone without a minimum of original creativity cannot be protected by copyright.” Put differently, copyrights shield creativity, not the process behind or after the creation. Since OpenAI is using articles for training and developing an LLM, the argument goes “that is a new and different purpose of use than using the articles just to read them or have a subscription news product,” Ms. Ziniti adds.
Moreover, a more “intellectual puzzle,” as tech analyst Benedict Evans put it in his blog, is that “on one hand all headlines are somewhere in the training data, and on the other, they’re not in the model.” The model may have crawled chunks of NYT’s massive investment in journalism, but it is not a database — LLMs infer patterns in language depending on the quantities of text, but don’t keep the original data. “ChatGPT might have looked at a thousand stories from the New York Times, but it hasn’t kept them,” Mr. Evans noted. Moreover, NYT’s articles may be a tiny fraction of the LLMs’ training data, and it could arguably function if one company removed its content.
A judge in November dismissed a case by comedian Sarah Silverman and other authors who sued OpenAI and Meta Platforms for having “ingested” their works. There was no evidence it “could be understood as recasting, transforming, or adapting the plaintiffs’ books,” the court said.
In his blog, Benedict Evans argued that the tussle over generative AI and intellectual property “is a completely new problem that we’ve been arguing about for 500 years.” LLMs are not unique in their appropriation of creative labour: artists take inspiration from prior works; people read 50 articles to write one. The difference, however, comes from AI’s ability to make an otherwise ordinary creative process possible at a massive scale, in an automated way, and by companies with billion-fold the collective power of writers, journalists, and coders. “This might be the difference between the police carrying wanted pictures in their pockets and the police putting face recognition cameras on every street corner — a difference in scale can be a difference in principle,” Mr. Evans wrote.
IP and generative AI
The lawsuit is the latest to articulate how generative AI appropriates, and also impacts, the creative process. It sets off allied concerns about how copyright laws define an author, whether companies need to limit the scraping of online content, whether said scraping should fairly compensate original creators, and what guardrails are needed to protect high-quality journalism.
It can travel down one of two legal roads, said Ms. Ziniti. The Times could opt for a settlement if OpenAI agrees to come up with an acceptable system to credit publishers and allow NYT to sit on the advisory board of that system, to be the keeper of the said guardrails. Or the case can reach the Supreme Court, where the arrival of a new tech behemoth with billions of dollars may highlight the need for new rules and safeguards. Moreover, Ms. Ziniti notes that the argument of ‘fair use’ may not hold weight with the Exhibit J evidence, and that verbatim outputs due to memorisation cannot be seen as “transformed.”
“I expect the court will split up the analysis, and that OpenAI will try to fix the output issues with filtering or engineer efforts,” she says. Filtering may happen with chatbots refusing, or altering, the original request.
Some have argued that existing legal frameworks will fall short when it comes to addressing copyright questions raised by the disruptive force of generative AI. Copyright laws, such as India’s Copyright Act, 1957, defines authors as the person “who causes the work to be created”. For literary work generated by a computer, an author is the person who caused the work by computer under the Act. “This provision can be broadly interpreted to include individuals who provide the necessary data or instructions to an AI system, resulting in the creation of computer-generated work,” consulting firm IIPRD wrote in a blog. The law, however, is not designed to account for the fact that generative AI does not create information, but is trained on datasets built using copyrighted work. The 161st Parliamentary Standing Committee Report in 2021 concluded that the Act is “not well equipped to facilitate authorship and ownership by Artificial Intelligence”.
Challenging copyright infringement, then, concurrently questions what it means to create, and to create art, and how best to value intellectual labour that becomes the scaffolding of generative AI, allowing companies to pocket billions of dollars. Any verdict over the IP question of generative AI, Mr. Evans pointed out, will be split between “two ideas of authenticity and two ideas of art.”
- On December 28, The New York Times filed a lawsuit against OpenAI and Microsoft, creators of ChatGPT and other generative AI content, for unlawful use of its work.
- The complaint is the first AI copyright lawsuit within the news ecosystem, arguing that the generative AI models threaten the publication’s business model and compromise the credibility of its “massive investment in its journalism.”
- OpenAI and Microsoft use NYT’s copies to train their large language models (LLMs), including ChatGPT and Copilot, and encode its copyrighted material for the LLMs to learn from.