Some of my thoughts, filtered slightly for public consumption.

Nobody Knows Anything About These Models

We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence. - Noam Shazeer, GLU Variants Improve Transformer, Feb 2020

We badly need a better theoretical understanding of LLM learning and capabilities.

Before I begin, you should know several key facts:

This seems to suggest that LLMs cannot work nearly as well as they do. But in practice we have found that:

Unfortunately, both of these techniques place LLMs well outside the reach of current formal analysis techniques. What would formal analysis have to look like for it to apply to LLMs as they are actually trained and used? First, we should ask what exactly we are trying to do. For frontier LLMs, the goal is to train them to acquire some sort of generalizeable intelligence which goes well beyond the training data. But what does this look like in more formal terms?

For any set of training data D and model M with bounded parameters and bounded precision, there is a minimum error ϵ0\epsilon \ge 0 with which MM can learn DD, and some minimum parameter norm NN at which this can be achieved (in many cases this will be the maximum possible value of NN). In order to be able to learn the desired parameter set PP which produces generalizeable intelligence rather than exact memorization of DD, one of the following must hold:

Intuitively these correspond to two very different situations:

Current frontier models are likely in the second case with respect to their training data, which means that the outcome of training (in the best case[2]) is determined by two factors:

Both of these questions are well beyond the capabilities of current techniques to analyze for the datasets and model topologies used in practice.

Our theoretical understanding of datasets is generally based on one of two approaches:

The information-theoretic arguments tend to produce very loose bounds that provide no insight into real use cases, because practical datasets such as text corpora are so far from any standard statistical distribution in ways we are not currently able to capture. The analytic arguments have not been successfully applied to these datasets at all, because we have not found a better way to characterize a manifold approximation of them besides empirically deriving one from a model trained on them, which rarely is able to tell us anything generalizeable beyond that specific trained model.

The situation on the model side is not much better:

This is a deeply uncomfortable situation, and one it seems unlikely we will be able to rectify. But we should probably try harder.


  1. ^

    Much discussion of LLM reasoning capabilities looks at specific architectural details, such as whether the model is decoder-only. However, the theoretical results we have thus far are mostly independent of these details[3], and we are reduced to hand-wavy arguments about why one architecture or another should learn faster or overfit less easily.

  2. ^

    Technically we don't need PP to be a local minimum if we rely on early stopping during training, but this is unstable and does not seem to be the way frontier models are trained.

  3. ^

    Here I'm assuming some choice of regularization parameters allows separating PP from boundary solutions with norm NN, without penalizing it so much that it is no longer a local minimum.

  4. ^

    Different architectures sometimes imply using training data structured in a different way, which may end up having more impact on the actual performance than theoretical results about the architectures themselves might imply.