For more than 20 years, Kit Loffstadt has written fan fiction exploring alternate universes for the heroes of “Star Wars” and the villains of “Buffy the Vampire Slayer,” and shares her stories for free online.
But in May, Ms Loffstadt stopped posting her creations after learning that a data company had copied her stories and fed them into the AI technology underlying ChatGPT, the viral chatbot. Dismayed, she hid her writing behind a locked account.
Ms. Loffstadt also helped organize an act of rebellion last month against AI systems. Along with dozens of other fan fiction writers, she posted a deluge of irreverent stories online to overwhelm and confuse the data-gathering services that fuel writers’ work on AI technology.
“Each of us has to do everything we can to show them that the result of our creativity is not for the machines to harvest the way they like,” said Ms Loffstadt, a 42-year-old voice actress from South Yorkshire in Britain. .
Fan fiction writers are just one group now staging riots against artificial intelligence systems as tech fever has gripped Silicon Valley and the world. In recent months, social media companies like Reddit and Twitter, news organizations like The New York Times and NBC News, authors like Paul Tremblay and actress Sarah Silverman have spoken out against AI sipping their data without permission.
Their protests have taken different forms. Writers and artists are locking their files to protect their work or boycotting certain websites that post AI-generated content, while companies like Reddit want to charge for accessing their data. This year at least 10 lawsuits have been filed against AI companies, accusing them of training their systems on the creative work of artists without consent. Last week, Ms. Silverman and authors Christopher Golden and Richard Kadrey sued OpenAI, the creator of ChatGPT, and others over AI’s use of their work.
At the heart of the rebellions is a new understanding that information online—stories, artwork, news articles, message board posts, and photos—can hold significant untapped value.
Known as “generative AI” for the text, images and other content it generates, the new wave of AI builds on complex systems, such as large language models, that are capable of producing human-like prose. These models are trained on aggregations of all kinds of data so they can answer people’s questions, mimic writing styles, or produce comedy and poetry.
That has sparked a search by tech companies for even more data to feed their AI systems. Google, Meta, and OpenAI have essentially used information from all over the Internet, including vast databases of fan fiction, troves of news articles, and collections of books, many of which were freely available online. In tech industry jargon, this is known as “scraping” the Internet.
OpenAI’s GPT-3, an artificial intelligence system released in 2020, spans 500 billion “tokens,” each representing parts of words found primarily online. Some AI models span over a trillion tokens.
The practice of Internet scraping is longstanding and was largely publicized by the companies and non-profit organizations that did it. But it was not well understood or seen as especially problematic by the companies that own the data. That changed after ChatGPT debuted in November and the public learned more about the underlying AI models that power chatbots.
“What is happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, founder and CEO of Nomic, an AI company. “Previously, the idea was to get value from data by opening it up to everyone and running ads. Now, the idea is that you lock your data, because you can extract much more value when you use it as input for your AI.
Data protests may have little long-term effect. Big-money tech giants like Google and Microsoft already sit on mountains of proprietary information and have the resources to license more. But as the era of easy-to-scratch content comes to a close, smaller AI startups and nonprofits hoping to compete with the big companies may not be able to get enough content to train their systems.
In a statement, OpenAI said that ChatGPT was trained on “licensed content, publicly available content, and content created by human AI trainers.” He added: “We respect the rights of creators and authors, and look forward to continuing to work with them to protect their interests.”
Google said in a statement that it was involved in discussions about how publishers might manage their content in the future. “We believe that everyone benefits from a vibrant content ecosystem,” the company said. Microsoft did not respond to a request for comment.
Data riots broke out last year after ChatGPT became a worldwide phenomenon. In November, a group of programmers filed a proposed class action lawsuit against Microsoft and OpenAI, claiming the companies had infringed their copyrights after their code was used to train an AI-powered programming assistant.
In January, Getty Images, which provides stock photos and video, sued Stability AI, an artificial intelligence company that creates images from text descriptions, alleging that the start-up had used copyrighted photos to train its systems. .
Then, in June, Clarkson, a law firm in Los Angeles, filed a 151-page class action lawsuit against OpenAI and Microsoft, describing how OpenAI had collected data from minors and saying that web scraping violated copyright law and constituted a theft”. On Tuesday, the firm filed a similar lawsuit against Google.
“The data rebellion we’re seeing across the country is society’s way of rejecting this idea that Big Tech simply has the right to take all information from any source and make it their own,” said Ryan Clarkson, the founder of Clarkson. .
Eric Goldman, a professor at Santa Clara University School of Law, said the arguments in the lawsuit were broad and the court was unlikely to accept them. But the wave of litigation is just beginning, he said, with a “second and third wave” that will define the future of AI.
Bigger companies are also shunning AI scrapers. In April, Reddit said it wanted to charge for access to its application programming interface, or API, the method by which third parties can download and analyze the social network’s vast database of person-to-person conversations.
Steve Huffman, Reddit’s CEO, said at the time that his company didn’t “need to give all that value to some of the biggest companies in the world for free.”
That same month, Stack Overflow, a question-and-answer site for computer programmers, said it would also ask artificial intelligence companies to pay for the data. The site has almost 60 million questions and answers. His movement was previously reported by Wiring.
News organizations are also resisting AI systems. In an internal memo on the use of generative AI in June, The Times said AI companies should “respect our intellectual property.” A Times spokesman declined to provide further details.
For individual artists and writers, fighting AI systems has meant rethinking where they publish.
Nicholas Kole, 35, an illustrator from Vancouver, British Columbia, was alarmed at how an AI system could replicate his distinctive artistic style and suspected that technology had scratched his work. He plans to continue posting his creations on Instagram, Twitter and other social media sites to attract customers, but has stopped posting on sites like ArtStation that post AI-generated content alongside human-generated content.
“It just feels like pointless theft by me and other artists,” Kole said. “It puts a pit of existential dread in my stomach.”
At the Archive of Our Own, a fan fiction database with more than 11 million stories, writers have increasingly pushed the site to ban data scraping and AI-generated stories.
In May, when some Twitter accounts shared examples of ChatGPT mimicking the style of popular fan fiction posted on the Archive of Our Own, dozens of writers took up arms. They blocked their stories and wrote subversive content to fool AI trackers. They also lobbied Archive of Our Own leaders to stop allowing AI-generated content.
Betsy Rosenblatt, who provides legal advice to Archive of Our Own and is a professor at the University of Tulsa Law School, said the site had a “maximum inclusion” policy and did not want to be in the position of discerning which stories were they wrote. with AI
For Ms. Loffstadt, the fan fiction writer, the fight against AI came about when she was writing a story about “Horizon Zero Dawn,” a video game in which humans battle AI-powered robots in a post-apocalyptic world. . In the game, she said, some of the robots were good and some were bad.
But in the real world, he said, “thanks to corporate arrogance and greed, they are being distorted into doing bad things.”