Early this week, wall street journal AI companies reported hitting a wall when collecting high-quality training data. today, new york times We've detailed some of the ways companies have addressed this. Of course, that includes activities that fall into the vague gray area of AI copyright law.
The story begins with OpenAI. In desperate need of training data, OpenAI developed the Whisper speech transcription model to overcome that hurdle, spending over 1 million hours training its most advanced large-scale language model, his GPT-4. He reportedly transcribed the YouTube video.according to it new york times, reports that the company was aware this was legally questionable, but believed it was fair use. OpenAI President Greg Brockman was personally involved in collecting the videos used. times is written.
OpenAI spokesperson Lindsey Held said: The Verge The company said in an email that it curate “unique” datasets for each model to “help us understand the world” and maintain global research competitiveness. Held added that the company uses “numerous sources, including public and private data partnerships,” and is considering generating its own synthetic data.
of times According to the article, the company discussed transcribing YouTube videos, podcasts, and audiobooks after exhausting its supply of useful data in 2021 and looking into other resources. By then, I had trained the model on data that included computer code from Github, a database of chess moves, and school assignment content from Quizlet.
Google spokesperson Matt Bryant said: The Verge In an email, the company said it had “seen unconfirmed reports” of OpenAI's activity and added, “Both our robots.txt file and our terms of service prohibit unauthorized scraping or downloading of YouTube content. ”, echoing the company's terms of service. YouTube CEO Neil Mohan made a similar statement this week about the possibility that OpenAI used YouTube to train its Sora video generation model. Bryant said Google will take “technical and legal measures” to prevent such abuse “where there is a clear legal or technical basis.”
According to , Google also collected transcripts from YouTube. Times” sauce. Bryant said the company trained the model “on some YouTube content in accordance with our agreements with YouTube creators.”
of times Google's legal department wrote that it asked the company's privacy team to adjust its policy language to expand what it can do with consumer data, including office tools like Google Docs. The new policy was reportedly intentionally announced on July 1, taking advantage of the distraction of the Independence Day weekend.
Meta similarly ran into the limits of the availability of good training data, with records showing times The company's AI team discussed unauthorized use of copyrighted works while working to catch up with OpenAI. After experiencing the “almost availability of English books, essays, poems, and news articles on the Internet,” the company decided to take steps such as paying book licensing fees or acquiring major publishers outright. It seems that they have considered it. Privacy-focused changes in the wake of the Cambridge Analytica scandal also apparently limit how consumer data can be used.
Google, OpenAI, and the broader AI training world are grappling with rapidly evaporating training data for models that get better the more data they absorb.of journal wrote this week that businesses could outnumber new content by 2028.
Possible solutions to that problem mentioned by journal Mondays will include training a model on “synthetic” data created by your own model, or so-called “curriculum learning.” This involves feeding the model with high quality data in an ordered manner. This model is much less informative, but neither approach is yet proven. But the other option for companies is to use whatever they can get, with or without permission, and given several lawsuits filed over the past year, that approach is no less risky. It can be said that it accompanies it.