Bill Forces Firms to Reveal Copyrighted AI Training Data

Lawmakers are pushing for greater transparency around the training data used to create powerful AI models. A newly proposed bill aims to compel tech companies to disclose any copyrighted materials incorporated into the datasets that underpin their artificial intelligence systems.

The Generative AI Copyright Disclosure Act

Introduced by Congressman Adam Schiff of California, the Generative AI Copyright Disclosure Act would mandate that any entity building a training dataset for AI models submit detailed reports on the copyrighted content contained within that data. These disclosures would need to summarize the protected works included and provide public access links if the datasets are openly available online. The requirement would also extend to any updates or modifications made to the training data over time.

AI Training Data Legislation Would Be Proactive

Under the bill’s provisions, companies would have to file these transparency reports within 30 days before releasing an AI model trained on the disclosed dataset. What’s more, the legislation would apply retroactively. It would require disclosures even for AI systems already deployed, like OpenAI’s GPT-4 and Anthropic’s Gemini models.

AI Training Material and Copyright Infringement

The issue of intellectual property rights has become a major flashpoint as generative AI has exploded in popularity. Content creators across industries have sounded alarms that their copyrighted works may be getting swept up in AI training without approval or compensation. The legal waters remain murky around whether this constitutes fair use under existing copyright law.

AI developers have contended their models simply learn patterns from broad data sources in the public domain, without necessarily replicating full copyrighted works. However, the staggering scale of some training datasets makes vetting the sources extremely challenging. Several major creative industry groups have endorsed Schiff’s transparency push, though the influential Motion Picture Association has not taken a public stance yet.

Meanwhile, independent initiatives like Fairly Trained have also proposed ways to certify when AI companies obtain permissions and licenses for their training data. As generative AI capabilities advance, pressure is mounting for guardrails to uphold intellectual property protections.

Ther Verge – Emilia David – April 10, 2024