- OpenAI, Anthropic, and other AI companies lack quality data to train their models.
- This could hinder AI development as companies race to develop the best products in a fast-growing field.
- The magazine says companies are now exploring other ways to train AI, such as using synthetic data.
Companies like OpenAI and Anthropic are scrambling to capture one of AI's most valuable resources: trusted data. This deficit could hinder the development of large-scale language models to power chatbots as the race to build the best products in a growing field intensifies.
OpenAI's ChatGPT and its chatbot competitors are typically trained on reams of information, such as scientific papers, news articles, and Wikipedia posts scraped from the web, to generate human-like responses. The higher quality and more reliable the data these models use, the better their ability to produce accurate and desired outputs. Theoretically so.
Based on that, the shortage could make it difficult for companies to make their AI products smarter. And Pablo Villalobos, an AI expert at research firm Epoch, told the Wall Street Journal that by 2028, the demand for high-quality data could outstrip the supply of available training materials by more than 50%. Told.
So why do tech companies seem to be fighting for reliable information?
First, only slices of online data are generally suitable for AI training. That's because most public information on the web contains sentence fragments and other textual flaws that can prevent AI from generating conversational responses. The lack of available data is further exacerbated by the large amount of AI-generated text already on the internet, which can contaminate models with nonsense. Experts call this process “model collapse.”
In addition, major news organizations, social media platforms, and other public sources are restricting access to content for training AI due to concerns about copyright, privacy, and fair compensation. People also don't seem keen on having access to iMessage conversations and other private text data for training purposes.
As a result, companies are scrambling to discover new data sources to power their tools. For example, sources told the Journal that OpenAI is discussing training its latest model, GPT-5, with YouTube video transcripts.
OpenAI is also discussing creating a data marketplace where providers could be compensated for content the company deems valuable for model training, sources familiar with the matter told WSJ. According to the magazine, Google is reportedly considering a similar technique, but researchers have not yet built a system to properly implement it.
Other companies are experimenting with so-called synthetic data to further develop their models. Jared Kaplan, chief scientist at Anthropic, said in an October 2023 interview with Bloomberg that Anthropic generated data internally that he fed into Claude, a family of AI chatbots. OpenAI, which developed ChatGPT, is also considering that tactic, a spokesperson told the Journal.
Concerns about a lack of data have been raised as users complain about the quality of AI chatbots.
Some users of GPT-4, OpenAI's most advanced model behind ChatGPT, claim that they are experiencing problems with the bot following instructions and responding to queries. Google has suspended the AI image generation feature of its Gemini model after users complained that it generated historically inaccurate photos of US presidents. In general, AI models tend to hallucinate false information that appears to be accurate.
While companies are looking for ways to continue training their models, some are willing to limit the size of their AI for the time being.
“I think we're at the end of the era where we have these giant mega-models,” OpenAI CEO Sam Altman said at a 2023 MIT conference event. We will continue to do better.” Other method. “
OpenAI and Google did not respond to requests for comment from Business Insider prior to publication. Antropic declined to comment.
Axel Springer, Business Insider's parent company, has a global deal that allows OpenAI to train models based on its media brands' reporting.
Axel Springer, Business Insider's parent company, has a global deal that allows OpenAI to train models based on its media brands' reporting.