Skip to main content

Verified by Psychology Today

Artificial Intelligence

Artificial Language Models Teach Us Nothing About Language

Neuroscientific research with artificial language models gives no new insights.

Key points

  • Large language models (LLMs) are capable of processing large quantities of data quickly and accurately.
  • However, this alone may not be enough to understand how the human brain responds to language.

A wave of neuroscientific research has attempted to exploit the sophisticated statistical power of large language models (LLMs) to explore how the human brain responds to language. Yet, one issue that some linguists have felt has not been well addressed is how this kind of research exposes new facts about language and may, in fact, pose an obstacle to genuine scientific insights.

As the accuracy of natural language processing (NLP) models increased over the 2010s, there was a clear sacrifice in interpretability. Effectively, the more powerful and accurate language models have become, the less cognitively plausible they seem to be. Even though bigger language models “learn” aspects of human language, like syntactic relations, at more accurate rates than smaller models, at the same time, the need for these bigger models to learn syntax decreases for most tasks that we actually need them for.

Other concerns have arisen. Humans parse sentences hierarchically, yet LLMs appear to have strong linear biases. Some discuss this issue using a tone that places less emphasis on the importance of linguistic creativity and generation, the hallmark of human language. Idan Blank noted recently that “language processing is arguably more than just prediction”—much like visual attention is “arguably” more than just photoreception.

Models of Language

Brushing over a rich and controversial history, an important theme in recent theoretical linguistics concerns how many properties of linguistic theory that were initially carried over from formal systems and mathematical models from the 1950s-70s are not appropriate for characterizing human psychology. A variety of these themes carry important implications for how we make appropriate use of LLMs.

There are certain linguistic theories that tend to be tied more intimately to research using LLMs due to their interest in the explanatory power of domain-general reasoning. For example, the framework of construction grammar is built on the assumption that humans memorize a large number of multi-unit constructions and then manipulate these memorized objects. Yet, with memorized constructions, we still need some kind of generative system to modify these or to form them in the first place.

Frameworks such as construction grammar confuse the artifacts of the language system (the outputs of language, like constructions) with the object of investigation itself. “Constructions” are a result of language; they do not constitute it. They are not a plausible object of psycholinguistic investigation: There are far too many independent factors that conspire into every given construction.

These objections are important for our understanding of artificial models. We cannot take a large number of constructions (i.e., linearized outputs of the underlying generative computational procedure of syntax) and expect to explain human language. We will assuredly get statistically significant approximations for parsing data and even neural responses by focusing on constructions and their distributional statistics, but these issues are much too coarse-grained to be objects of linguistic theory.

Neurobiology

Last month, a paper from the lab of Evelina Fedorenko at MIT published in Neurobiology of Language argued that “lexical semantic content, not syntactic structure, is the main contributor to ANN-brain similarity of fMRI responses in the language network.” Yet, simply because ANNs align with fMRI blood oxygenation level-dependent (BOLD) responses best via lexico-semantics, it does not follow that syntactic information is not neurally represented.

The effects documented in the Fedorenko Lab paper are heavily driven by content words, which carry clear conceptual semantic content, whereas we know from behavioral research that function words carry very few processing costs. However, we also know that functional grammatical structure is essential in delivering syntactic information, and some linguists have even gone as far as to argue that cross-linguistic diversity emerges exclusively from functional grammatical information (as opposed to content words like nouns and verbs).

Fedorenko and colleagues conclude with this: “The critical result—that lexical-semantic content is the main contributor to the similarity between ANN representations and neural ones—aligns with the idea that the goal of the human language system is to extract meaning from linguistic strings.” What goes unnoticed is that “lexico-semantic content” also delivers modifications to syntactic information.

One would be hard-pressed to find anybody who would disagree with the idea that humans use language to extract meaning. This is not a scientific discovery. To my knowledge, there are no predictions from within theoretical linguistics concerning which scale of neural activity or complexity syntax is encoded. If it is not to be found in the BOLD signal, then too much the worse for fMRI.

Another Fedorenko Lab paper from August used GPT-based encoding models and fMRI data to accurately predict neural responses to language. The authors conclude: “A systematic analysis of the model-selected sentences reveals that surprisal and well-formedness of linguistic input are key determinants of response strength in the language network.”

There are no reasons from within linguistic theory to doubt the processing importance of surprisal. Fedorenko and colleagues test if their language model predicts responses to sentences that are expected to trigger minimal language network activity (“We were sitting on the couch”) versus sentences that are meant to trigger maximal activity (“People on Insta be like ‘Gross’”; “Jiffy Lube of therapies”; “Notice how you reacted to WTF”). These were collated often from social media usage.

Yet, notions like “lexical access” and “semantic integration” (that are dismissed by Fedorenko and colleagues as being outdated) crucially form part of theories. “Surprisal” measures are not theories. What is even more surprising is that Fedorenko and colleagues end up showing that measures of semantic plausibility and grammaticality both explain variance beyond surprisal. Yet, they report these results without offering some kind of theoretical account for this.

Thus, while the authors begin their paper by claiming that models of the brain built around linguistic theory are problematic and old-fashioned, they ultimately end up supporting these traditional concepts.

Like many others, I remain unconvinced that the best way to build a science of language processing in the brain is to use fMRI and expose participants to sentences like “People on Insta be like ‘Gross’” and then measure how surprised the language network is and see if this aligns with an encoding model.

Celebration of these results has been widespread both on social media and in some more established science publications. Neuroscience driven by novel exploitations of statistical patterns in neural and language data is part of the reason why The Guardian asked in 2022, “Are we witnessing the dawn of post-theory science?”

If we are, there is little to celebrate.

advertisement
More from Elliot Murphy Ph.D.
More from Psychology Today
More from Elliot Murphy Ph.D.
More from Psychology Today