Some of my personal thoughts. The views expressed here have not been reviewed or approved by Anthropic or any of my previous employers, and do not necessarily reflect their views.

Inventing Transformers

What does the tech tree for transformers look like? The full history is long and complicated, with ideas branching and converging, fading and reemerging. But in a narrow sense each element of the architecture can be traced back to a relatively quick sequence of innovations from 2012 to 2017. Here I build up the decoder-only transformer architecture as each piece was invented.

2012 - The Prequel: AlexNet

AlexNet first showed that deep learning could outperform the then-dominant ML methods, and introduced the basic recipe for training such models. Most of its technical novelties were not carried over into the transformer architecture, though it was the first to split layers across multiple GPUs—a practice that became necessary for all sizeable transformer models. It did, however, popularize ReLU activations and dropout—neither of which it invented—and the transformer would later adopt both.

2013 - Word2vec

Word2vec did not invent the concept of mapping words to continuous vector embeddings, but before it the standard was still to use sparse n-grams. By making the model extremely simple, the authors were able to train it on far more data than previous methods. It performed so well that it made embeddings the default for new methods.

2014 - Seq2seq

Seq2seq made neural networks the premier method for NLP tasks like translation. It used an existing architecture—the Long Short-Term Memory network—to allow information to flow from earlier tokens, which overcame the limitations of vanilla RNNs but still broke down after a few sentences.

2014 - Adam

Training deep neural networks using the optimization techniques of the time was notoriously unstable and required careful bespoke tuning of hyperparameters with every architecture, data or algorithm change. Adam improved on Stochastic Gradient Descent with momentum in two ways. First, it was more stable, allowing researchers to train on larger and more diverse datasets. Second, by decoupling step size from gradient scale, it let researchers easily train novel models and distinguish fundamental improvements from lucky choices of hyperparameters.

2014 - Attention

Letting every token interact with every other through a fully connected network would cost O(n²d²) per layer—quadratic in both the sequence length n and the model dimension d. Seq2seq instead reduced this to a cost linear in sequence length, by compressing all input information into a single fixed-size vector. This rapidly broke down as sequences became longer than a few sentences. Attention provided a middle ground: it preserves the all-to-all information flow of a fully connected network, but by consulting O(n) key and value vectors for each input (or previous output) token it does so at only O(n²d) per layer—still quadratic in sequence length, but linear in the model dimension, a factor of d cheaper. Subsequent deep learning models started incorporating attention modules with positive results.

2015 - ResNet

All of these models were relatively shallow—seq2seq was 4 layers for each of the encoder and decoder. Deeper networks would often fail to learn at all because their initialization left them too far off the useful distribution, or would exhibit poor stability. This was because it was hard to effectively pass information though many layers, as each tended to distort and drop information from the previous ones. It is possible in principle for an MLP module with ReLU activation to express the identity function over a reasonable range of inputs, using a large positive shift to each component before reversing it with a shift after the activation function. However, this means large, awkward gradients from the start and adds another implicit constraint that the optimization must try to preserve.

ResNet reframed the role of the MLP module as applying a residual, so the output of each layer is the previous layer's output plus the MLP module's output. Thus an MLP initialized to 0 makes the layer act as the identity function, and since it is an architectural constant there is no more pressure to maintain this pass-through behavior during optimization. This enabled training models dozens or hundreds of layers deep.

2017 - Attention is All You Need

The famous paper that put all of this together in the right order, using a few other small tricks that I glossed over above (ReLU, LayerNorm) and inventing a couple of its own. Positional encodings allow the model to effectively learn relationships between tokens that depend on relative positions—something that previously could only be done with more expensive network modules. Multi-headed attention provides multiple attention modules at each layer, which makes attention more expressive and allows multiple kinds of relationships to be learned at each layer for the same token.