Asking Nicely Might Solve Generative AI’s Copyright Problem
Not only does licensing data for AI development provide the obvious benefit of avoiding protracted (and public) battles on the world stage, but it also provides many convenient side benefits.
If you live in a part of the world where tech and AI are hot topics, you’re likely to have heard about the NY times lawsuit against OpenAI where OpenAI was sued for copyright infringement. The complaint alleged both that copyrighted material owned by NYT was used for training AI (making an illegal copy) as well as the fact that the verbatim regurgitation of copyrighted material in the output by OpenAI was an infringement of NYT’s copyright.
While in latest court proceedings, OpenAI makes arguments that it does not seek to replace NY Times as a source of news and the verbatim copy of output was a bug not a feature, it's likely that those in the sidelines will learn that the easiest (and perhaps most conservative risk position) might just be to license copyrighted material from the owner before using it to train AI.
I ’d argue that not only does this solution provide the obvious benefit of avoiding protracted (and public) battles on the world stage, but it also has many convenient side benefits. This lawyer is certainly not suggesting the doctrine of fair use (the exception used by OpenAI, Google and others for scraping) be re-written or revised, but merely that as a practical matter -- licensing agreements might solve this icky tricky issue without anyone getting hurt and perhaps everyone getting rewarded.
Companies such as Apple, Adobe and even OpenAI have already signed some licensing deals for data, perhaps having predicted such tricky outcomes. While the royalty/licensing dollar amounts of these licensing deals are likely steep, when weighed against the expenses of a lengthy litigation, they might seem a wise and worthy investment.
Some of us music fans might also remember the Drake and Weekend sound-alike “Heart on my sleeve” tune that was generated by AI and caused an uproar within the creative community. What is interesting to note is that even though takedown notices under the Digital Millennium Copyright Act are meant for (as the name of the law suggests) ‘copyright’ claims, many studies have shown that it has been used widely to request the takedown of random content that may be offensive but not necessarily copyright infringing. In this case, the AI-generated song was quickly removed from platforms such as Youtube and Spotify (though it may be back up at the time of this writing) not as a result of a deep analysis of copyright law, but likely because platforms felt like taking down the song was the right thing to do. There are tons of ethical arguments to be made in favor of removal of songs, art, or creative works that are copied from original owners without attribution or compensation -- and many others have made these arguments better than I can.
It becomes likely, therefore, that new business models might emerge where creatives are given some level of control over what to share with AI vendors. The good news here is that when incentivized to share data that is intended to be used for AI, creators might actually help out by ensuring that the data that is shared is of adequate quality, is actually usable for AI, and comes with a handy guide for data provenance (a somewhat painful problem for those dealing with AI development).
Regulations coming out on AI from the EU, such as the EU AI Act mandate that companies creating foundational models share a summary of the copyrighted material used to train the general-purpose AI model. Having certainty about the source of such data, through clear licensing contractual commitments makes it easier for companies to comply with such laws.
Such data licensing agreements are also likely to include provisions that require creators (licensors) to continually update the data and ‘keep it fresh’ -- another hugely helpful plus for data scientists working to maintain and measure datasets for AI. Creators might even have the obligation to ensure that their data is properly obtained and is not infringed in the first place. A licensing deal might even ask the creator to ensure that some data is tested for bias. Ethicists rejoice!
Lastly, those who are building AI agents for specific AI products and industries such as in healthcare or fintech and those dealing with sensitive information might benefit from knowing that their training data is trusted, reliable, and obtained with all necessary legal consents and permissions.
About the Author
You May Also Like