[Updated on 2025-09-27]
How did we go from simple programs that count words to systems that can write poetry and code? It wasn’t one big achievement. It was a series of small engineering steps. We stopped trying to teach the computer grammar, and we started trying to compress the data.
The Evolution
In the beginning, it was very simple. We treated a document like a bag of words. We just counted them. If the word “entropy” appears 5 times, maybe the text is about physics. But this was stupid in a way. To the computer, “dog” and “puppy” are completely different things. It didn’t know they are related. We had no “meaning”, only counts.
The first big step was Embeddings (Word2Vec). We decided to represent words as vectors of numbers. The idea is simple: you know a word by the company it keeps. If “king” and “queen” appear in similar sentences, their vectors should be close to each other.
This gave us the first interesting math result:
But there was a problem. These vectors were static. The word “bank” has the same vector if you are at a “river bank” or a “money bank.” The model didn’t care about the context.
To fix context, we tried RNNs (Recurrent Neural Networks). The idea makes sense: read the sentence like a human, one word at a time. The model keeps a small memory (hidden state) of what it read before.
But RNNs were painful to train. They forget things easily because the gradients disappear over long sentences. Also, they are slow. You cannot process the 10th word until you finish the 9th. GPUs hate this. GPUs want to do everything at the same time, not in a line.
This is why the Transformer won. It uses “Attention”. Instead of reading left-to-right, the model looks at every word at the same time. It calculates how much the word “bank” relates to every other word in the sentence to understand if it is a river or a building. It turned a sequential problem into a parallel problem.
Finally, we have the training objective: Predict the Next Token. To predict the next token perfectly, the model has to build a model of the world inside its weights. That is where we are today.