March 22, 2024 5:59pm | Updated at 5:59pm (IST)
Taking a cue from the $60 million deal between Reddit and Google, this could be the beginning of a new era in data licensing for content platforms that offers few legal recourses for individuals who leave data on the site. .
Data-scraping rituals by big tech companies like OpenAI and Google to train AI models have already sparked a wave of copyright infringement lawsuits. Artists and writers with notable works were clearly expected to be more litigious. Theft will be marked and more easily contested.
Separately, deep-pocketed tech companies are signing deals with platforms that use their proprietary data to train AI systems. Google's $60 million licensing deal with Reddit heralds a boom in AI licensing for content platforms.
However, most content platforms have broad and comprehensive data protection agreements with their users. And these platforms offer users few legal recourses.
What is driving these AI licensing deals?
The data scraping problem is crippling content platforms such as social media companies and news publishers who are struggling to generate revenue. Reddit, the wormhole of the internet, has been trying to cut costs given that it doesn't earn as much advertising revenue as other social media platforms.
(Subscribe to our technology newsletter Today's Cache for the day's top tech news)
CEO Steve Huffman was concerned that Reddit was losing tens of millions of dollars to support companies that wanted to use data for free and then charge a fee.
Last June, the company decided for the first time to charge for access to its application programming interfaces (APIs), the systems that outside software developers use to get content from the platform. (Platform X, owned by Elon Musk, also recognized the same and cut off access to its free API last April.)
The deal with Google is a blessing for Reddit, considering its impending IPO is up in the air. Huffman's company will now be able to reap $60 million a year and take advantage of the AI boom through revenue generated through API fees.
Other deals between AI companies and content platforms of various types hungry for new sources of funding are also on the horizon.Report by 404 media It has been revealed that Automattic, the parent company of Tumblr and WordPress, is in talks to potentially contract with OpenAI and AI image tools maker Midjourney to provide training data from user posts.
Last September, Meta admitted that the “vast majority” of its training data for developing its AI assistant consisted of public posts on Facebook, Instagram, etc. Nick Clegg, the company's president of international affairs, said: Reuters He said he avoided datasets that contained large amounts of personal information, such as posts from LinkedIn.
Is social media content safe?
But that doesn't translate into tangible relief for social media users. “Anyone who posts publicly on social media platforms should be aware that their content may be used to train AI models. This is true whether you use a social platform like Reddit or a smaller social platform like Reddit. Social platforms are rich in human-generated information in the form of posts, links, photos, videos, etc. “AI models require human oversight and validation to provide factually correct information,” said technology analyst Debra Aho Williamson.
These partnerships demonstrate that AI companies recognize the value of human-generated content in training and improving models. But that doesn't mean you should worry less. “Meta, for example, has built a huge advertising business by selling targeted ads, and its success is driven by the fact that it has built a world-class database of information about its users. Users find it creepy. Using the same information to improve AI models, even if it's public information, can feel invasive as well.” she pointed out.
Williamson said if TikTok's ban in the U.S. wasn't imminent, it would have been the social media platform most likely to sign a deal with an AI company. “TikTok videos contain a wealth of information, and much of that content is public. LinkedIn is also a company whose data could be useful to AI companies,” she said. LinkedIn is also famously owned by Microsoft, which has deep AI partnerships with companies like OpenAI and Mistral.
Very little is off-limits. “Content creators of all kinds, including social media users, journalists, designers, and filmmakers, are concerned about how AI companies will use their content and how AI companies will completely disintermediate their business. You have to worry about the possibilities,'' she explained.
How can you prevent tools from misusing your data?
The terms of service for most content platforms are usually a very extensive, barely decipherable word salad. For example, Grammarly's user agreement explicitly states that it does not sell your data to third parties, but it does store and archive your data. But their user agreement doesn't allow them to make that clear in many words.
“You grant us a license to your User Content for the limited purpose of operating, providing, improving, troubleshooting, and debugging our products (for example, accepting or rejecting your offer , which may help train our suggestion engine); Protect our products (for example, to analyze usage patterns to prevent abuse); Customize our products (for example, to provide you with personalized offers); Develop new products or features (for example, create a tone detector). Use the information you upload or provide to us (such as your name) to inform others within your organization about your Encourage participation in Grammarly Business or Grammarly for Education team accounts.”
Once these platforms gain a certain level of functionality, it becomes difficult for users to avoid using them. “The sad truth is that unless you're a Luddite and refuse to use the internet (or any real technology), there's not much you can do about it. It's like asking what you can do to protect yourself. The device collects diagnostic data that is used to train and improve its “internal” systems. “If you use a Microsoft device, tomorrow your OpenAI products could be considered 'internal systems' by Microsoft,” said Subha Chugh, a lawyer and AI advisor. Masu.
Her advice is to always assume that your data will eventually be shared with multiple different parties. “Use an end-to-end encryption service provider to share sensitive information and do not upload data to third-party tools or language models online,” she said.
Three weeks ago, e-signature company Docusign published an FAQ quietly admitting that it had used customer data, typically including business-sensitive and personal information, to train its AI. “Although it states that the customer 'contractually consents' to such use, I cannot find any mention in the terms that the customer consents to such activities. But perhaps the customer 'contractually consents' to such use.” '', you can withdraw that consent without penalty, right?'' Alexander Hanff, a data protection and privacy legal consultant, said on LinkedIn.
Hanff explained that “this should have been enough to sink the company, but the problem is that people use these tools out of convenience” and rarely seriously consider these loopholes. do.
Chugh also thinks there are signs that people are becoming less sensitive to privacy invasions. For most users, the alternative to such freely available tools that provide privacy comes at a significant cost. “For many companies, data sharing is a direct or indirect source of revenue, which allows them to keep the price of their services low. For many, the exchange for sharing data is a ‘free’ service. It’s worth it to have access to it,” she reasoned.
Given the level of AI adoption in the future, there are increasing calls for AI regulation to not only protect data privacy but also prevent misuse.
When you ask OpenAI if it's possible to train an AI model without using copyright data, the answer is a resounding no. In a statement responding to the inquiry by the House of Lords' Communications and Digital Select Committee, Sam Altman's company said it was “impossible” to use the data without violating copyright law.
However, some recent announcements wired This week shows that that's actually not true. A non-profit organization called Fairly Trained, founded in January of this year by Ed Newton-Rex, has given its first certification to a large-scale language model called KL3M that was built without infringing copyright. According to reports, Rex resigned from his executive position at the image generation startup Stability AI in protest of its data scraping policies.
The dataset was developed by Chicago-based legal tech consulting startup 273 Ventures and used carefully selected legal, financial, and regulatory documents.
Another group of French government-backed researchers is working with startups, including Allen AI, Nomic AI, and EleutherAI, which claims to have the largest AI training dataset composed entirely of public domain text. We have released the data set that we claim. Even on a small scale, these suggest a hopeful future in which ethical datasets are possible, even if it takes time.