Would We Have Taylor Swift Without The Beatles?  The Battle Over AI Training

February 22, 2024 Robert Rosenberg

The 2022 launch of ChatGPT introduced artificial intelligence (AI) to the general public and raised a number of novel issues in the area of copyright. Artists learned that the vast data sets used by developers to train large language models contained copyrighted works, most of which had been scraped from the internet.

Imitation Is The Sincerest Form Of Flattery

As a bit of context, let’s remember that most great creative works are inspired (or at least influenced) by works that came before them. During the Renaissance, journeyman painters learned their craft by studying the works of the great masters. Jump ahead by a few centuries and we know that Elvis Presley was influenced by the gospel music he grew up on. His music, in turn, influenced such artists as the Beatles. And artists like Taylor Swift, Lady Gaga and Billy Joel count the Beatles (in some cases, specifically Paul McCartney) among the musicians that most influenced their music. I’m sure we can find the same connections for modern authors, fine artists, photographers, and the like. All artists are the product of their life experiences, including (and probably most importantly) all of the art they encountered throughout their lives.

In some cases, subsequent works are completely different from the works that influenced them. In others, they hew more closely to what came before so that the influence is more immediately observable. When works get too close to the original, copyright lawsuits arise. In these cases, courts generally look at two factors: whether the author of the new work had access to the older work, and whether there is a substantial similarity between the new work and the older work. While there are several defenses and exceptions to copyright infringement, these cases are usually very fact-specific, with courts carefully comparing the potentially infringing work to the original.

It is with this background that I’ll turn our conversation to the current debate over whether developers of AI systems should be permitted to use copyrighted works in training large language models.

Here Come The Lawsuits

Artists and publishers argue that in scraping the internet for data to train AI systems, developers made copies of their works without permission or a license. Copying is one of the exclusive rights held by a copyright owner, and therefore these are infringements. In two prominent lawsuits against OpenAI, one brought by Sarah Silverman and other authors, and the other brought by the Authors Guild on behalf of itself and a group of authors that include John Grisham, Jodi Picoult and George R.R. Martin, authors allege that the unlicensed, uncredited and uncompensated use of their copyrighted works in training ChatGPT violates the authors’ rights and creates an unfair business practice in competing with the original works.

AI developers argue that their use of copyrighted works in the training of AI models is protected under fair use exception. In the same way that modern artists are influenced by those that came before them, AI systems are simply learning about all of the kinds of art that exist and are creating new works based on an amalgamation of its learning. According to Sam Altman, the co-founder and CEO of OpenAI, “training data for large language models is not used to teach these systems to copy the data, but to learn the fundamentals of language – including vocabulary, grammar, sentence structure, and even basic logic. Ultimately, these AI systems are not search engines or databases, and are not designed to repeat or even store the content on which they are trained.”

In the more recent case against OpenAI brought by the New York Times, the Times added some novel arguments. There, the Times said that its content is offered behind a paywall and thus, only accessible to paying subscribers. By scraping this content and making it available for ChatGPT to cite in answers to users’ questions, OpenAI is making paid content available for free to its users and therefore costing the Times potential paying customers. In addition, unlike a Google search which produces links to the Times’ website in response to user queries, ChatGPT has synthesized the information in the Times articles and provides it to users in a manner that voids the need for users to actually visit the Times website, thereby depriving the Times of advertising revenue it could derive from traffic to its site. Lastly, the Times argues that because ChatGPT is prone to mistakes (also known as “hallucinations”), when ChatGPT provides an incorrect answer and cites to a Times article, ChatGPT is harming the reputation of the Times.

Many of these cases are still making their way through the courts. On February 12th, the Northern District of California judge in the Sarah Silverman case dismissed most of the claims. While the judge left some room for the plaintiffs to amend their complaint, most significantly, she left the question of direct copyright infringement for the court to ponder further. Setting aside the additional issues raised in the Times case, we will likely see some cases result in verdicts this year that will inform how artists and developers move forward in this area.

Do We Really Need To Reinvent The Wheel?

Unlike those who argue that we need new laws to address the issues raised by AI, I think courts can settle these matters using the same criteria they have used for decades. First and foremost, the focus should be on the output provided to users by AI systems. Does the output compete with the original work (or the author of the original)? If so, that leans toward a finding of infringement. If the output permits OpenAI to create a business off of the labor of authors, then they should be compensated. However, if the output of the AI systems is so novel and different from any copyrighted work used in its training, then how are copyright owners harmed? In fact, in such cases the copyright owners of the original works wouldn’t even know that their works were used to train the AI systems if they weren’t told.

Copyright law originated as a way for Congress to “promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.” To incentivize artists to continue creating and inventing, they need a way to monetize their works. In each of these cases, courts will balance this need with the goal of also promoting the development of new and useful technologies. Stay tuned as these cases make their way through the courts.

References: