When it comes to AI chatbots, bigger is usually better.
Large language models like ChatGPT and Bard, which generate conversational source text, improve as they receive more data. Every day, bloggers take to the internet to explain how the latest developments—an app that summarizes articles, AI-generated podcasts, a honed model that can answer any pro basketball-related question—will “change everything.”
But making AI bigger and more capable requires processing power few companies possess, and there is growing concern that a small group, including Google, Meta, OpenAI and Microsoft, wield near-total control over the technology.
Also, larger language models are more difficult to understand. They are often described as “black boxes”, even by the people who design them, and leading figures in the field have expressed “concern” that AI’s goals may not ultimately align with our own. If bigger is better, it’s also more opaque and more exclusive.
In January, a group of young academics working on natural language processing, the branch of AI focused on linguistic understanding, launched a challenge to try to change this paradigm. The group called for teams to create functional language models using data sets that are less than one-ten-thousandth the size of those used by the most advanced large language models. A successful mini-model would be almost as capable as higher-end models, but much smaller, more accessible, and more human-friendly. The project is called BabyLM Challenge.
“We are challenging people to think small and focus more on building efficient systems that more people can use,” said Aaron Mueller, a Johns Hopkins University computer scientist and BabyLM organizer.
Alex Warstadt, a computer scientist at ETH Zurich and another organizer of the project, added: “The challenge raises questions about human language learning, rather than ‘How big can we make our models?’ in the center of the conversation.
Large language models are neural networks designed to predict the next word in a given sentence or phrase. They are trained for this task using a corpus of words collected from transcripts, websites, novels, and newspapers. A typical model makes guesses based on example sentences and then adjusts depending on how close it is to the correct answer.
By repeating this process over and over again, a model forms maps of how words are related to each other. In general, the more words a model is trained on, the better it will be; each sentence provides context to the model, and more context translates into a more detailed impression of what each word means. OpenAI’s GPT-3, released in 2020, was trained on 200 billion words; DeepMind’s chinchilla, released in 2022, was trained in a trillion.
For Ethan Wilcox, a linguist at ETH Zurich, the fact that something non-human can generate language presents an exciting opportunity: Could AI language models be used to study how humans learn language?
For example, nativism, an influential theory dating back to the early work of Noam Chomsky, claims that humans learn language quickly and efficiently because they have an innate understanding of how language works. But language models also learn language quickly, and apparently without an innate understanding of how language works, so perhaps nativism doesn’t hold water.
The challenge is that language models learn very differently from humans. Humans have bodies, social lives, and rich sensations. We can smell mulch, feel feather vanes, bump into doors, and taste mints. At first, we are exposed to simple spoken words and syntax that are often not represented in writing. So, Dr. Wilcox concluded, a computer that produces language after being trained on billions of written words cannot tell us much about our own language process.
But if a language model were exposed only to the words a young human encounters, it could interact with language in ways that might address questions we have about our own abilities.
So, along with half a dozen colleagues, Dr. Wilcox, Dr. Mueller, and Dr. Warstadt conceived of the BabyLM Challenge, to try to bring language models a little closer to human understanding. In January, they sent out a call for teams to train language models on the same number of words that a 13-year-old human finds, roughly 100 million. Candidate models would be tested on how well they generated and captured the nuances of language, and a winner would be declared.
Eva Portelance, a linguist at McGill University, came across the challenge the day it was announced. Her research straddles the often blurred line between computer science and linguistics. The first forays into AI, in the 1950s, were driven by a desire to model human cognitive abilities on computers; the basic unit of information processing in AI is the “neuron”, and the first language models of the 1980s and 1990s were directly inspired by the human brain.
But as processors became more powerful and companies began working on marketable products, computer scientists realized that it was often easier to train language models on huge amounts of data than to force them into psychologically informed structures. As a result, Dr. Portelance said, “we are given a text that is human-like, but there is no connection between us and how they work.”
For scientists interested in understanding how the human mind works, these large models offer limited insight. And because they require tremendous processing power, few researchers can access them. “Only a small number of industry labs with enormous resources can afford to train models with billions of parameters in trillions of words,” Dr. Wilcox said.
“Or even to charge them,” added Dr. Mueller. “This has made research in the field feel a little less democratic of late.”
The BabyLM Challenge, Dr. Portelance said, could be seen as a step away from the arms race for larger language models and a step toward more accessible and intuitive AI.
Large industry laboratories have not ignored the potential of such a research program. Sam Altman, CEO of OpenAI, said recently that increasing the size of language models would not lead to the same kind of improvements seen in recent years. And companies like Google and Meta have also been investing in research into more efficient language models, informed by human cognitive structures. After all, a model that can generate language when trained on less data could also be scaled up.
Whatever the gains a successful BabyLM may have, for those behind the challenge, the goals are more academic and abstract. Even the prize subverts the practical. “Just pride,” Dr. Wilcox said.