Nobody Knows Anything About These Models
We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence. - Noam Shazeer, GLU Variants Improve Transformer, Feb 2020
We badly need a better theoretical understanding of LLM learning and capabilities. It has become clear that LLMs can learn something we could call "generalized intelligence" — they are not merely memorizing a massive corpus, but are able to apply that corpus to tasks that are in some non-trivial sense outside of their training data. But what is lacking is any theoretical understanding of why they can do this, what this generalized intelligence is, and what its limitations are.
The Current State of Research
Obviously this has been the subject of much recent research, which can be broadly broken down into two categories:
- Experimental research that attempts to understand how existing trained models work
- Theoretical research that attempts to characterize what kinds of problems models should be able to solve based on their size, structure and (less often) their training data or objectives
Experimental Research
So far, ~all of the results of interest to people building or working with the models has been experimental, and most of them are based on analogies to human thought such as "attention" or "concepts". The leader in this space by far is Anthropic, who recently released a pair of blockbuster papers:
- Circuit Tracing: Revealing Computational Graphs in Language Models
- On the Biology of a Large Language Model
These results are able to at least partially[0] explain how the LLMs under analysis operate answer certain questions or perform tasks such as simple addition. For example, for the prompt "the capital of the state containing Dallas is " they find the next token "Austin" is primarily influenced by a pair of implications encoded in the relationships between nodes — first one between "state", "Dallas" and "Texas" that encodes the fact that Dallas is in Texas, then one between "capital", "Texas" and "Austin" that encodes the fact that Austin is the capital of Texas. In fact, all of the attribution graphs they produce can be characterized as following a chain of implications, which the model learned during training and is able to activate based on the input tokens and the intermediate nodes activated by prior implications.
Limitations
This is able to work for a large class of interesting problems, including those which might at first glance seem implausible such as addition and translation, because the models have memorized a staggering number of implications — such as the approximate sum of all pairs of integers up to some number of values (which is then translated into an exact sum by combining with a lookup table for least-significant-digit sums). However, it does not seem plausible that the same kind of process can explain how a model does something like writing a sophisticated algorithm. Attempting to apply the same kind of attribution graph analysis to more complex problems runs into several issues.
First, as the graph becomes deeper, the number of potential intermediate nodes explodes, and it becomes more important to understand how the model knows to weight particular intermediate nodes before the entire graph is realized. It is one thing to know that "Texas" is going to be relevant to completing "the capital of the state containing Dallas is ", quite another to know what lemmas to prove when proving a theorem.
But perhaps a more fundamental limitation is that the intermediate nodes are labeled by observing their in various completions and identifying chains of reasoning that are intelligible to the labeler. This means labeling is only possible when the algorithm being used is understood by the reader. But the "generalized intelligence" used by humans to e.g. write a novel program is not understood by any human — in fact this is what makes LLMs so interesting in the first place! And even if it was, we have no idea how closely analogous it is to the what LLMs are doing when they appear to exhibit similar capabilities.
Theoretical Research
Theoretical research tends to focus either on the training datasets for models or on the model topology. Neither approach has had any success at analyzing models or datasets of anywhere near the complexity of frontier LLMs.
Our theoretical understanding of datasets is generally based on one of two approaches:
- Information-theoretic arguments that treat datasets as random samples from some distribution
- Analytic arguments that treat datasets as noisy clusters around linear subspaces or manifolds
The information-theoretic arguments tend to produce very loose bounds that provide no insight into real use cases, because practical datasets such as text corpora are so far from any standard statistical distribution in ways we are not currently able to capture. The analytic arguments have not been successfully applied to these datasets at all, because we have not found a better way to characterize a manifold approximation of them besides empirically deriving one from a model trained on them, which rarely is able to tell us anything generalizeable beyond that specific trained model.
The situation on the model side is not much better. We have some basic results:
- Under plausible assumptions, for any any sufficiently large LLM (regardless of exact architecture[1]) will eventually learn the smallest norm solution that reproduces the training data within .
- There are certain tasks that we know they cannot learn algorithms that generalize to, e.g. recognizing the language consisting of all strings of the form .
Limitations
However, the limitations of the current theoretical results are numerous:
- Most results concern very simple functions, which generalized intelligence clearly is not.
- Most results are only able to handle single-layer models, while practical models generally employ dozens of layers.
- Most LLMs are trained on data that we do not want them to reproduce (almost) exactly, e.g. if there is only one sentence in their training data that starts like "My favorite in this whole wide world is " we do not want them to assign 99.99% probability to it ending with "apples".
- Results about what kinds of algorithms an LLM can execute do not translate at all to results about what kinds of algorithms they can write.
These limitations would seem to suggest that LLMs cannot work nearly as well as they do. But in practice we have found that:
- With carefully tuned regularization weights and limited training to avoid over-fitting, models tend to learn much more generalizeable parameters that perform well outside of the training data.
- Rather than trying to have the model natively recognize a tricky formal language class, giving them access to e.g. a Python interpreter and letting them write programs which it then executes works quite well.
Unfortunately, both of these techniques place LLMs well outside the reach of current formal analysis techniques. We have been completely unable to bridge the gap between what sort of programs the model itself can encode to the programs it can learn to write.
Is Understanding Possible?
Given the fundamental limitations of the experimental approaches, what would a theoretical analysis have to look like for it to apply to LLMs insofar as they appear to exhibit some sort of generalized intelligence which goes beyond reproducing their training data.
First, let's try to formalize the problem statement. For any set of training data and model with bounded parameters and bounded precision, there is a minimum error with which can learn , and some minimum parameter norm at which this can be achieved (in many cases this will be the maximum possible value of ). In order to be able to learn the desired parameter set which produces generalized intelligence rather than exact memorization of , one of the following must hold:
- achieves error, which would imply
- and error is locally minimized at [2]
Intuitively these correspond to two very different situations:
- The model is too small to approximately memorize the data and can only approximate it using generalizeable methods
- The model is overparameterized and we are relying on regularization to favor generalizeable methods over memorization
Current frontier models are likely in the second case with respect to their training data, which means that the outcome of training (in the best case[3]) is determined by two factors:
- What kinds of functions can capture the dataset at various error rates
- What kinds of functions can be implemented in the chosen model architecture with small norm
Characterizing the first factor for frontier LLM training datasets is essentially asking what kind of function generalized intelligence is, so insofar as LLMs are the best implementation we can find of this function it may be circular to try to study it in order to understand LLMs. But this is not certain! Studying human cognition in more formal detail could be helpful here, although it is possible human intelligence is a distinct class of intelligence with only surface similarities to LLM "intelligence".
The second factor is difficult to study without specifying the class of function more clearly, which introduces the same problem. However, we may be able to at least learn something about what generalized intelligence is and isn't based on the limitations of what functions can be implemented on model architectures that seem to posses it.
The unfortunately possibility is that intelligence may be irreducibly complex — an emergent behavior that cannot be fully broken down into components that we can understand.
- ^
The Anthropic papers do not capture inference paths that cause changes in attention, and the results they show are (by their own admission) somewhat cherry-picked for presenting clearer interpretations.
- ^
Much discussion of LLM reasoning capabilities looks at specific architectural details, such as whether the model is decoder-only. However, the theoretical results we have thus far are mostly independent of these details[4], and we are reduced to hand-wavy arguments about why one architecture or another should learn faster or overfit less easily.
- ^
Technically we don't need to be a local minimum if we rely on early stopping during training, but this is unstable and does not seem to be the way frontier models are trained.
- ^
Here I'm assuming some choice of regularization parameters allows separating from boundary solutions with norm , without penalizing it so much that it is no longer a local minimum.
- ^
Different architectures sometimes imply using training data structured in a different way, which may end up having more impact on the actual performance than theoretical results about the architectures themselves might imply.