Generative artificial intelligence (GenAI) company Anthropic has claimed to a US court that using copyrighted content in large language model’s (LLMs) training data counts as “fair use”, and that “today’s general-purpose AI tools simply could not exist” if AI companies had to pay licenses for the material.
Under US law, “fair use” permits the limited use of copyrighted material without permission, for purposes such as criticism, news reporting, teaching, and research.
In October 2023, a host of music publishers including Concord, Universal Music Group and ABKCO initiated legal action against the Amazon- and Google-backed generative AI firm Anthropic, demanding potentially millions in damages for the allegedly “systematic and widespread infringement of their copyrighted song lyrics”.
The filing, submitted to a Tennessee District Court, alleged that Anthropic, in building and operating its AI models, “unlawfully copies and disseminates vast amounts of copyrighted works – including the lyrics to myriad musical compositions owned or controlled by publishers”.
It added while the AI technology may be complex and cutting edge, the legal issues around the use of copyrighted material are “straightforward and long-standing”.
“A defendant cannot reproduce, distribute, and display someone else’s copyrighted works to build its own business unless it secures permission from the rightsholder,” it said. “That principle does not fall away simply because a company adorns its infringement with the words ‘AI’.”
The filing further claimed that Anthropic’s failure to secure copyright permissions is “depriving publishers and their songwriters of control over their copyrighted works and the hard-earned benefits of their creative endeavors”.
To alleviate the issue, the music publishers are calling on the court to make Anthropic pay damages; provide an accounting of its training data and methods; and destroy all “infringing copies” of work within the company’s possession.
However, in a submission to the US Copyright Office on 30 October (which was completely separate from the case), Anthropic said that the training of its AI model Claude “qualifies as a quintessentially lawful use of materials”, arguing that, “to the extent copyrighted works are used in training data, it is for analysis (of statistical relationships between words and concepts) that is unrelated to any expressive purpose of the work”.
It added: “Using works to train Claude is fair as it does not prevent the sale of the original works, and, even where commercial, is still sufficiently transformative.”
On the potential of a licensing regime for LLM’s ingestion of copyrighted content, Anthropic argued that always requiring licenses would be inappropriate, as it would lock up access to the vast majority of works and benefit “only the most highly resourced entities” that are able to pay their way into compliance.
“Requiring a license for non-expressive use of copyrighted works to train LLMs effectively means impeding use of ideas, facts, and other non-copyrightable material,” it said. “Even assuming that aspects of the dataset may provide greater ‘weight’ to a particular output than others, the model is more than the sum of its parts.
“Thus, it will be difficult to set a royalty rate that is meaningful to individual creators without making it uneconomical to develop generative AI models in the first place.”
In a 40-page document submitted to the court on 16 January 2024 (responding specifically to a ‘preliminary injunction request’ filed by the music publishers in November), Anthropic took the same argument further, claiming “it would not be possible to amass sufficient content to train an LLM like Claude in arm’s-length licensing transactions, at any price”.
It added that Anthropic is not alone in using data “broadly assembled from the publicly available internet”, and that “in practice, there is no other way to amass a training corpus with the scale and diversity necessary to train a complex LLM with a broad understanding of human language and the world in general”.
“Any inclusion of Plaintiffs’ song lyrics – or other content reflected in those datasets – would simply be a byproduct of the only viable approach to solving that technical challenge,” it said.
It further claimed that the scale of the datasets required to train LLMs is simply too large to for an effective licensing regime to operate: “One could not enter licensing transactions with enough rights owners to cover the billions of texts necessary to yield the trillions of tokens that general-purpose LLMs require for proper training. If licenses were required to train LLMs on copyrighted content, today’s general-purpose AI tools simply could not exist.”
While the music publishers have claimed in their suit that Anthropic could easily exclude their copyrighted material from its training corpus, the company said it has already implemented a “broad array of safeguards to prevent that sort of reproduction from occurring”, including placing unspecified limits on what the model can reproduce and training the model to recognise copyrighted material, among “other approaches”.
It added although these measures are generally effective, they are not perfect: “It is true that, particularly for a user who has set out to deliberately misuse Claude to get it to output material portions of copyrighted works, some shorter texts may slip through the multi-pronged defenses Anthropic has put in place.
“With respect to the particular songs that are the subject of this lawsuit, Plaintiffs cite no evidence that any, let alone all, had ever been output to any user other than Plaintiffs or their agents.”
Similar copyright cases have been brought against other firms for their use of generative AI, including OpenAI and Stability AI, as well as tech giants Microsoft, Google and Meta. No decisions have been made by any courts as of publication, but the eventual outcomes will start to set precedents for the future of the technology.
In its remarks to the US Copyright Office (again, completely separate to the case now being brought against Anthropic and other tech firms), the American Society of Composers, Authors, and Publishers (ASCAP) said that: “Based on our current understanding of how generative AI models are trained and deployed, we do not believe there is any realistic scenario under which the unauthorized and non-personal use of copyrighted works to train generative AI models would constitute fair use, and therefore, consent by the copyright holders is required.”
In complete contrast to Anthropic, it further claimed that: “The use of copyrighted materials for the development [of] generative AI models is not transformative. Each unauthorized use of the copyrighted material during the training process is done in furtherance of a commercial purpose.”
In September 2023, just a month before the music publisher field their legal complaint, Anthropic announced that e-commerce giant Amazon will invest up to $4bn in the company, as well as take a minority stake in it. In February 2023, Google invested around £300m in the company, and took a 10% stake. FTX founder Sam Bankman-Fried also put in $500m to Anthropic in April 2022 before filing for bankruptcy in November that year.