Raleigh News Today

collapse
Home / Daily News Analysis / The Atlantic created a searchable database of the music used to train AI

The Atlantic created a searchable database of the music used to train AI

Jun 22, 2026  Twila Rosenbaum  3 views
The Atlantic created a searchable database of the music used to train AI

The Atlantic has introduced a searchable database that allows the public to explore the music used to train artificial intelligence models. Compiled by reporter Alex Reisner, the database catalogs four datasets containing millions of tracks, many of which are copyrighted and used without explicit permission. This initiative sheds light on the opaque practices of AI training and the ethical dilemmas surrounding the use of creative works.

The largest of these datasets comprises 12 million tracks, with another containing 9 million. Two smaller sets each hold over 100,000 songs. According to Reisner, these datasets have been downloaded thousands of times, and companies such as Google and Stability AI have confirmed their use in research papers. However, the legality of using these tracks for AI training remains murky, as many sources, like the Free Music Archive, only allow free streaming for personal use, not commercial applications.

The process of collecting this music is not straightforward. Three of the datasets are distributed as lists of links to songs on YouTube or Spotify. Developers then use automated tools to download the actual audio, often bypassing logins and advertisements. These tools violate the terms of service of the platforms, raising further legal questions about the sourcing method. The database includes works from a wide range of artists, from pop stars like Lady Gaga and Fred Again.. to iconic acts like Radiohead, Aphex Twin, Wu-Tang Clan, and Bruce Springsteen, as well as experimental composer Hainbach.

Background on AI Music Training

The use of copyrighted music in AI training has become a contentious issue in the tech and creative industries. AI models, such as those used for music generation or recommendation algorithms, require vast amounts of data to learn patterns and produce coherent output. However, the datasets often include material scraped from the internet without clear rights clearance. This has led to lawsuits from artists and publishers who argue that their work is being used without compensation or permission.

The databases documented by The Atlantic are not new; they have been circulating among researchers and companies for years. But the public now has a tool to see which songs are being used. For instance, a search for a specific artist or track reveals whether it appears in one or more of the datasets. This transparency is a step toward accountability, though it does not resolve the underlying copyright disputes.

AI music generation tools like Suno and Udio have also come under scrutiny for their reliance on such datasets. Earlier this year, investors poured $400 million into Suno, highlighting the industry's growth despite legal challenges. The Grammys have grappled with how to handle AI-generated music, while artists like SZA have publicly condemned the use of AI in music creation.

Key Facts from the Database

  • Four datasets were uncovered: two massive ones (12 million and 9 million tracks) and two smaller ones (over 100,000 tracks each).
  • Google and Stability AI have acknowledged using these datasets in research.
  • The datasets are distributed as links to YouTube and Spotify, with automated tools used to download audio in violation of platform terms.
  • Artists appearing include Lady Gaga, Fred Again.., Radiohead, Aphex Twin, Wu-Tang Clan, Bruce Springsteen, and Hainbach.
  • The Atlantic's AI Watchdog site allows users to search for songs, books, and other media used in AI training.

Implications for Copyright and Ethics

The release of this database reignites the debate over fair use in the age of AI. Proponents of AI argue that training on public data constitutes fair use, while creators contend that it is a form of theft. The fact that many of these songs are copyrighted and used without permission adds fuel to the fire. Some datasets, like the Free Music Archive, allow free streaming but require licensing for commercial use, yet AI companies often ignore these restrictions.

Moreover, the method of obtaining audio via YouTube and Spotify violates the terms of service of those platforms. This not only risks legal action from the platforms but also undermines the revenue streams of artists who rely on streaming royalties. The Atlantic database makes it possible to see exactly which creators are affected, potentially empowering them to pursue legal claims or demand compensation.

The issue extends beyond music. Similar datasets exist for books, images, and videos, all used to train the next generation of AI. The lack of transparency has prompted calls for regulation, with some countries exploring mandatory disclosure of training data. The European Union's AI Act, for example, includes provisions for transparency regarding copyrighted material.

In the music industry, the response has been mixed. Some artists see AI as a tool for innovation, while others view it as an existential threat. The Atlantic database does not take a side, but it provides the evidence needed to inform the debate. As AI continues to evolve, the question of how to balance technological progress with creators' rights remains unanswered.

This story is part of a larger series on AI and music, covering everything from investment trends to Grammy policies. The database is a living resource, updated as new datasets are discovered. For now, it serves as a crucial window into the inner workings of AI development and its reliance on the creative output of millions of artists.

As of June 2026, the database is fully searchable on The Atlantic's website. Users can explore by artist, song, or dataset, and see how their favorite music is being used to train the machines that may one replace human musicians. Whether that future is welcomed or feared, the database ensures that the conversation is grounded in facts.


Source: The Verge News


Share:

Your experience on this site will be improved by allowing cookies Cookie Policy