meta-employees-discuss-using-copyrighted-content-for-a-training-techcrunch

Meta, the parent company of Facebook, has been embroiled in a legal battle over the use of copyrighted materials to train its AI models. Court documents unsealed recently shed light on internal discussions among Meta employees regarding the use of copyrighted works, obtained through questionable means, for AI training purposes.

The lawsuit, known as Kadrey v. Meta, is just one of many AI copyright disputes making their way through the U.S. court system. The plaintiffs, which include prominent authors like Sarah Silverman and Ta-Nehisi Coates, are challenging Meta’s assertion that training its AI models on IP-protected works falls under the doctrine of “fair use.”

The new filings in the case reveal a series of internal work chats between Meta staffers that provide insight into how the company may have utilized copyrighted data, particularly books, to train its AI models. In one chat, Xavier Martinet, a Meta research engineer, suggested acquiring e-books at retail prices to build a training set, rather than negotiating licensing deals with individual publishers. Despite concerns raised about the legality of using unauthorized copyrighted materials, Martinet argued that many startups were likely already using pirated books for training.

Melanie Kambadur, a senior manager for Meta’s Llama model research team, cautioned about the need for approvals when using publicly available data for training models. However, she noted that Meta’s legal team was becoming less conservative in granting such approvals, citing the company’s increased resources and ability to fast-track the approval process.

Discussions of Libgen, a controversial “links aggregator” that provides access to copyrighted works, also surfaced in the filings. Some within Meta believed that using Libgen for model training was essential to maintaining competitiveness in the AI space. However, efforts were made to mitigate legal risks, such as removing clearly pirated data and refraining from publicly acknowledging the use of Libgen datasets.

Further revelations in the filings suggest that Meta may have scraped Reddit data for model training and considered using Quora content, despite previous decisions against it. The company’s leadership expressed a need for more training data beyond first-party sources like Facebook and Instagram posts, highlighting the challenges in obtaining sufficient data for AI model development.

The ongoing legal battle between the plaintiffs and Meta has seen multiple amendments to the complaint, indicating the high stakes involved for both parties. Meta’s decision to bring in Supreme Court litigators from the law firm Paul Weiss underscores the importance of the case for the company.

As the case continues to unfold, the implications of Meta’s practices in using copyrighted materials for AI training remain a point of contention. The outcome of Kadrey v. Meta could have far-reaching consequences for the tech industry as a whole, shaping the boundaries of fair use and intellectual property in the age of artificial intelligence.

Meta did not respond to a request for comment on the matter, leaving the resolution of the legal dispute to the courts.