What is the big deal with transformers?

Nearly everything currently called AI — the chatbots, the code assistants, the embeddings under every search box — runs on the architecture from one 2017 paper. I’m working through that paper, Vaswani et al. (Vaswani et al., 2017), with a narrow question: what actually changed? Not attention itself — translation models had been bolting attention onto RNNs since Bahdanau et al. (Bahdanau et al., 2015), letting the decoder glance back over the source sentence as it generated. What the paper threw out was the recurrence underneath. The title is literal: attention is all you need, no recurrent loop required. Every position can look at every other position in the same layer.

The original design is an encoder–decoder, but most models since keep only one side — the encoder lineage that BERT (Devlin et al., 2019) established for understanding tasks, and the decoder lineage behind GPT (Radford et al., 2018) ¹One asterisk on “every position at once”: the decoder masks future positions, so each token attends only to what precedes it. That’s what makes generation possible — GPT never gets to peek at its own future. .

So, what is the big deal with transformers?¶

The limitation shows up most on long sentences. RNNs were effectively stuck reading left to right, one · word · at · a · time, so anything the first word has to say to the last word gets relayed through every hidden state in between. By the last word, the first is a rumor. The transformer inverts this: every position can attend to the full sequence at once ²Self-attention over $n$ tokens costs $O(n^2)$ in sequence length —parallel per layer, but memory and compute still bite on long contexts. , reweighting which tokens matter most for the one being updated. Between any two positions, however far apart, the path is a single hop.

Figure 1. How far a signal travels between the first and last word. An RNN relays it through every hidden state in between — the path grows with the sequence — while self-attention connects the two positions directly, no matter how far apart they sit.

This mechanism is called attention, and it’s what enables the model to be aware of the entire sequence at once. It can be thought of as a search, and the terminology sort of implies that. Each word produces three vectors:

$Q$ (query) — what a position is searching for
$K$ (key) — what each position advertises, to be matched against queries
$V$ (value) — the content a position contributes when attended to

Worth noting: in self-attention every position generates its own $Q$ , $K$ , and $V$ rows. The query is not a separate token to predict — it is the vector for the position currently doing the looking.

The output of this process is a weighted sum of values, weighted by how relevant each word is to the query. The attention function is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The softmax function normalizes the match scores so they sum to 1. This is what allows the model to focus on the most relevant parts of the sequence. The $\sqrt{d_k}$ term — $d_k$ being the dimension of each key vector — scales the dot product so it doesn’t grow too large as that dimension increases. Large scores push softmax into regions with tiny gradients, which slows learning down.

Toy self-attention on one sentence. Click a word to make it the query and watch which words it attends to — darker means more attention.

it attends most to Orion.

How many heads?¶

Eight, in the original paper. Multi-head attention is a technique that runs this process several times in parallel, each “head” with its own learned projections. The point is that different heads can specialize — one might track grammatical subjects, another might track pronoun references — whereas a single head would just average all of that into mush. Each head works on a smaller slice of the dimensions ( $d_k = d_{model}/h$ ), so the total compute stays roughly the same as one full-size head.

The same query word through three heads at once. Click a word — each head attends to a different set of words, so the model tracks several relations in parallel.

nearby words

the subject

the verbs

Positional encoding¶

Since the model looks at every word at once, order isn’t baked into the math — to bare attention, dog bites man and man bites dog are the same bag of words, even though the order is the entire point. The fix is to add positional encoding ³Vaswani et al. use fixed sinusoids; many later models learn position embeddings instead, with similar results in practice. , which tags each word with where it sits — “I’m word number 5” — so the model gets order without reading the sentence in sequence.

Figure 2. Each dimension is a sine wave of a different wavelength. Read straight down one position (red) and the values where it crosses each wave form that position's signature — every position gets a different one.

Computationally speedy¶

Compared to RNN architectures, the transformer computes all words at once, enabling more parallelization and less training time ⁴The paper’s own numbers: the base model reached 27.3 BLEU on WMT14 EN–DE for roughly $3.3 \times 10^{18}$ FLOPs — GNMT spent $2.3 \times 10^{19}$ to score 24.6. About twelve hours on 8 P100s. . An RNN has to finish one token before it can start the next, so it can’t fully use a GPU’s parallel cores; the transformer maps onto that hardware much more cleanly.

Figure 3. An RNN passes state down the chain — each step waits on the one before. A transformer connects every position to every other in a single layer, so they compute at once.

So the big deal was never attention — Bahdanau had that in 2014. It was noticing that once every position can reach every other position in one hop, the recurrent loop underneath wasn’t load-bearing anymore. Throw it out, and what’s left parallelizes.

References¶

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. 3rd International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1409.0473

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI technical report.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762

↑ Back to top