What is the big deal with transformers?
I’m working through Vaswani et al. (Vaswani et al., 2017) with a narrow question: what actually changed? Before transformers, most language models advanced through a sequence one step at a time, RNNs especially, carrying a running state forward and never revisiting earlier tokens directly. The transformer architecture The paper is encoder–decoder; most models today keep only one side (BERT-style encoder, GPT-style decoder). breaks that paradigm: every position can look at every other position in the same layer.
So, what is the big deal with transformers?
The limitation shows up most on long sentences. They were effectively stuck reading left to right, one · word · at · a · time. They only ever had access to the previous tokens in the sequence, so by the end you can imagine how difficult it gets to recall what was happening at the beginning. The transformer inverts this: every position can attend to the full sequence at once Self-attention over tokens costs in sequence length—parallel per layer, but memory and compute still bite on long contexts. , reweighting which tokens matter most for the one being updated. On long-context tasks, validation curves can look like over training.
This mechanism is called ‘attention’, and it’s what enables the model to be ‘aware’ of the entire sequence at once. It can be thought of as a search, and the terminology used to describe the process sort of implies that. Each word produces three vectors:
- (query) — what a position is searching for
- (key) — what each position advertises, to be matched against queries
- (value) — the content a position contributes when attended to
Worth noting: in self-attention every position generates its own , , and rows. The query is not a separate token to predict—it is the vector for the position currently doing the looking.
The output of this process is a weighted sum of values, weighted by how ‘relevant’ each word is to the query. The formula is:
- — the dimension of each key vector
The softmax function normalizes the match scores so they sum to 1. This is what allows the model to ‘focus’ on the most relevant parts of the sequence. The term scales the dot product so it doesn’t grow too large as the dimension of the key vector increases — large scores push softmax into regions with tiny gradients, which slows learning down.
How many heads?
Eight, in the original paper. Multi-head attention is a technique that runs this process several times in parallel, each “head” with its own learned projections. The point is that different heads can specialize — one might track grammatical subjects, another might track pronoun references — whereas a single head would just average all of that into mush. Each head works on a smaller slice of the dimensions (), so the total compute stays roughly the same as one full-size head.
Positional encoding
Since the model looks at every word at once, it struggles with word order — SGA really flops is effectively the same as flops SGA really. The solution is to add positional encoding Vaswani et al. use fixed sinusoids; many later models learn position embeddings instead, with similar results in practice. , which basically tags each word with “I’m word number 5”, and so on. This provides order without having to process the words sequentially.
Computationally speedy
Compared to RNN architectures, the transformer computes all words at once, enabling more parallelization and less training time. An RNN has to finish one token before it can start the next, so it can’t fully use a GPU’s parallel cores; the transformer maps onto that hardware much more cleanly.