Nobody Knows Anything About These Models
We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence. - Noam Shazeer, GLU Variants Improve Transformer, Feb 2020
We badly need a better theoretical understanding of LLM learning and capabilities.
Before I begin, you should know several key facts:
- Under plausible assumptions, for any any sufficiently large LLM (regardless of exact architecture[0]) will eventually learn the smallest norm solution that reproduces the training data within .
- There are certain tasks that we know they cannot learn algorithms that generalize to, e.g. recognizing the language consisting of all strings of the form .
- Most LLMs are trained on data that we do not want them to reproduce (almost) exactly, e.g. if there is only one sentence in their training data that starts like "My favorite in this whole wide world is " we do not want them to assign 99.99% probability to it ending with "apples".
This seems to suggest that LLMs cannot work nearly as well as they do. But in practice we have found that:
- With carefully tuned regularization weights and limited training to avoid over-fitting, models tend to learn much more generalizeable parameters that perform well outside of the training data.
- Rather than trying to have the model natively recognize a tricky formal language class, giving them access to e.g. a Python interpreter and letting them write programs which it then executes works quite well.
Unfortunately, both of these techniques place LLMs well outside the reach of current formal analysis techniques. What would formal analysis have to look like for it to apply to LLMs as they are actually trained and used? First, we should ask what exactly we are trying to do. For frontier LLMs, the goal is to train them to acquire some sort of generalizeable intelligence which goes well beyond the training data. But what does this look like in more formal terms?
For any set of training data D and model M with bounded parameters and bounded precision, there is a minimum error with which can learn , and some minimum parameter norm at which this can be achieved (in many cases this will be the maximum possible value of ). In order to be able to learn the desired parameter set which produces generalizeable intelligence rather than exact memorization of , one of the following must hold:
- achieves error, which would imply
- and error is locally minimized at [1]
Intuitively these correspond to two very different situations:
- The model is too small to approximately memorize the data and can only approximate it using generalizeable methods
- The model is overparameterized and we are relying on regularization to favor generalizeable methods over memorization
Current frontier models are likely in the second case with respect to their training data, which means that the outcome of training (in the best case[2]) is determined by two factors:
- What kinds of functions can capture the dataset at various error rates
- What kinds of functions can be implemented in the chosen model architecture with small norm
Both of these questions are well beyond the capabilities of current techniques to analyze for the datasets and model topologies used in practice.
Our theoretical understanding of datasets is generally based on one of two approaches:
- Information-theoretic arguments that treat datasets as random samples from some distribution
- Analytic arguments that treat datasets as noisy clusters around linear subspaces or manifolds
The information-theoretic arguments tend to produce very loose bounds that provide no insight into real use cases, because practical datasets such as text corpora are so far from any standard statistical distribution in ways we are not currently able to capture. The analytic arguments have not been successfully applied to these datasets at all, because we have not found a better way to characterize a manifold approximation of them besides empirically deriving one from a model trained on them, which rarely is able to tell us anything generalizeable beyond that specific trained model.
The situation on the model side is not much better:
- Most theoretical results concern very simple functions, which generalizeable intelligence clearly is not.
- Most theoretical results are only able to handle single-layer models, while practical models generally employ dozens of layers.
- All guarantees go out the window when we allow the model to write programs which are executed by an interpreter. We have been completely unable to bridge the gap between what sort of programs the model itself can encode to the programs it can learn to write.
This is a deeply uncomfortable situation, and one it seems unlikely we will be able to rectify. But we should probably try harder.
- ^
Much discussion of LLM reasoning capabilities looks at specific architectural details, such as whether the model is decoder-only. However, the theoretical results we have thus far are mostly independent of these details[3], and we are reduced to hand-wavy arguments about why one architecture or another should learn faster or overfit less easily.
- ^
Technically we don't need to be a local minimum if we rely on early stopping during training, but this is unstable and does not seem to be the way frontier models are trained.
- ^
Here I'm assuming some choice of regularization parameters allows separating from boundary solutions with norm , without penalizing it so much that it is no longer a local minimum.
- ^
Different architectures sometimes imply using training data structured in a different way, which may end up having more impact on the actual performance than theoretical results about the architectures themselves might imply.