8.10 Three Different Transformer Architectures: Understanding, Generation, and Input-Output Conversion
In the previous sections, we looked separately at the Transformer Encoder and the Transformer Decoder.
- The core of the encoder is self-attention, which lets each token in a sequence see both left and right context at the same time, so it is well suited for understanding tasks.
- The core of the decoder is masked self-attention, which only lets each token see tokens before its current position, so it is well suited for autoregressive generation.
The original Transformer is a complete encoder-decoder structure: the encoder first understands the input sequence, and the decoder then generates the target sequence step by step based on the encoder output. At this point, a natural question is:
Do all Transformer models have to contain both an encoder and a decoder?
The answer is no. As models developed later, Transformer was split into several common architectures: Encoder-only, Decoder-only, and Encoder-Decoder. They all come from Transformer, but the tasks they suit and the ways they are trained are not the same.
In this section, we compare these three structures side by side.
8.10.1 Starting from the Original Transformer
The original Transformer was designed for machine translation. Machine translation has an input sentence and an output sentence. For example:
English: I love deep learning.
Chinese: 我喜欢深度学习。
This kind of task naturally separates into two parts. The first part is understanding the source-language sentence:
\[ x_1, x_2, \dots, x_m \]
The second part is generating the target-language sentence:
\[ y_1, y_2, \dots, y_n \]
So the original Transformer uses an encoder-decoder architecture.
The encoder processes the source sentence:
\[ H = \operatorname{Encoder}(X) \]
The decoder generates the target sentence:
\[ p(y_t \mid y_{<t}, X) = \operatorname{Decoder}(y_{<t}, H) \]
Here, \(H\) is the contextual representation output by the encoder.
When generating each token, the decoder not only looks at the target tokens already generated, but also looks at the source-sentence representation through cross-attention. Therefore, the complete Transformer can be written as:
\[ X \rightarrow \operatorname{Encoder} \rightarrow H \rightarrow \operatorname{Decoder} \rightarrow Y \]
If the task is to take one input and generate another output, this structure is very natural. But later people found that the Transformer encoder and decoder can also be used separately.
8.10.2 Encoder-Only: Better for Understanding
Encoder-only models keep only the Transformer Encoder and do not use a decoder. The structure can be written simply as:
\[ X \rightarrow \mathrm{Transformer Encoder} \rightarrow H \]
where \(H\) is the contextual representation of every token:
\[ H = [h_1, h_2, \dots, h_n] \]
The key property of encoder-only models is:
Every token can see both the left and right context at the same time.
That is, in encoder self-attention, the \(i\)-th token can attend to:
\[ x_1, x_2, \dots, x_n \]
This includes both tokens on its left and tokens on its right. Therefore, encoder-only models are very suitable for understanding tasks, such as text classification, sentiment analysis, named entity recognition, sentence matching, extractive question answering, and embedding representation learning. These tasks usually do not require the model to generate a long text token by token; instead, they require the model to fully understand an existing input.
The most typical encoder-only model is BERT.
BERT stands for Bidirectional Encoder Representations from Transformers. The name already states its core idea: it uses the Transformer Encoder to learn bidirectional contextual representations.
For example, in the sentence:
The animal did not cross the street because it was tired.
when the model processes it, it can see the preceding context:
The animal did not cross the street because
and it can also see the following context:
was tired
This bidirectional context is very helpful for understanding tasks, because the meaning of a word often depends on information from both sides.
One of BERT’s pretraining tasks is masked language modeling. For example, some tokens in a sentence are replaced with [MASK]:
The animal did not cross the street because [MASK] was tired.
The model has to predict the masked word from the left and right context. This is different from an autoregressive language model, which can only predict the next word from left context:
\[ p(x_t \mid x_{<t}) \]
Masked language modeling is more like:
\[ p(x_i \mid x_{\setminus i}) \]
That is, the model predicts the masked token from all context except the current position. So BERT is not a model for generating text from left to right; it is better at understanding and representing existing text.
How, then, should we use the output of an encoder-only model?
We know that an encoder-only model outputs a sequence of contextual representations:
\[ h_1, h_2, \dots, h_n \]
Different tasks use different parts of these representations. For text classification, we usually take the representation of a special token, such as [CLS] in BERT:
\[ h_{\mathrm{[CLS]}} \]
and then attach a classification head:
\[ \hat{y} = \mathrm{Classifier}(h_{\mathrm{[CLS]}}) \]
For token-level tasks such as named entity recognition, we classify the representation of each token separately:
\[ \hat{y}_i = \mathrm{Classifier}(h_i) \]
For sentence embeddings, we can pool the token representations, for example by average pooling:
\[ h_{\mathrm{avg}} = \frac{1}{n} \sum_{i=1}^{n} h_i \]
So the output of an encoder-only model does not directly generate text. It provides a set of high-quality contextual representations. What task head we attach afterward depends on the specific task.
8.10.3 Decoder-Only: Better for Generation
Decoder-only models keep only the masked self-attention part of the Transformer Decoder. Note, however, that because the model is decoder-only, it usually has no encoder and therefore no cross-attention. It uses only masked self-attention and feed-forward networks. The structure can be written simply as:
\[ X \rightarrow \mathrm{Transformer Decoder Blocks} \rightarrow H \rightarrow \mathrm{LM\ Head} \rightarrow p(x_{t+1}) \]
The key property of decoder-only models is:
Every token can only see tokens before itself.
That is, the \(i\)-th token can only observe:
\[ x_1, x_2, \dots, x_i \]
and cannot see:
\[ x_{i+1}, x_{i+2}, \dots, x_n \]
This constraint is implemented by the causal mask. Therefore, decoder-only models are naturally suited for autoregressive language modeling:
\[ p(x_1, x_2, \dots, x_n) = \prod_{t=1}^{n} p(x_t \mid x_{<t}) \]
That is, the probability of the whole text is decomposed into step-by-step conditional probabilities. The most typical decoder-only models are the GPT series.
The core training objective of GPT is next-token prediction. Given a text:
I love deep learning
During training, the model learns:
\[ p(\text{love} \mid \text{I}) \]
\[ p(\text{deep} \mid \text{I love}) \]
\[ p(\text{learning} \mid \text{I love deep}) \]
In other words, the model always predicts the next token from the previous tokens. This matches the decoder-only structure exactly, because masked self-attention ensures that the model cannot see future tokens when predicting a position.
At inference time, GPT generates text in the same way:
\[ x_1 \rightarrow x_2 \rightarrow x_3 \rightarrow \cdots \]
Each step predicts the next token from the prefix that has already been generated. Therefore, decoder-only models are especially suitable for open-ended generation tasks, such as continuation, dialogue, summarization generation, code generation, answer generation, and instruction following. These tasks have a common property: the output is usually not a fixed category, but a piece of text that needs to be generated step by step.
Since encoder-only models are better for understanding and decoder-only models are better for generation, can decoder-only models also do understanding tasks? The answer is yes. For example, for text classification, we can rewrite the problem as a generation task.
The original classification form might be:
Is the sentiment of this sentence positive or negative?
A decoder-only model can handle it with a prompt:
Sentence: I really like this movie.
Sentiment:
Then we ask the model to generate positive.
In other words, decoder-only models can unify many tasks as “given a prefix, generate an answer.” This is one reason modern large language models are so powerful. They do not necessarily need a separate classification head for every task; instead, they can turn different tasks into text generation through natural-language prompts.
Structurally, however, decoder-only models are still unidirectional. When processing an input, each position can only see the context on its left. But when the complete prompt is used as the prefix, the model can see the whole prompt at the answer position. For example:
Question: What is the capital of France? Answer:
When the model generates Paris, it can already observe the entire preceding question. This is the basic way decoder-only models are used for understanding tasks: instead of making every token inside the input interact bidirectionally, we organize the task as a prefix and generate the answer afterward.
8.10.4 Encoder-Decoder: Better for Input-to-Output Conversion
Encoder-decoder models use both an encoder and a decoder. The structure can be written as:
\[ H = \operatorname{Encoder}(X) \]
\[ p(y_t \mid y_{<t}, X) = \operatorname{Decoder}(y_{<t}, H) \]
Their key feature is that input understanding and output generation are separated. The encoder can model the input sequence bidirectionally:
\[ x_i \text{ can see } x_1, x_2, \dots, x_n \]
The decoder models the output sequence autoregressively:
\[ y_t \text{ can only see } y_1, y_2, \dots, y_{t-1} \]
At the same time, the decoder reads the encoder output through cross-attention:
\[ \operatorname{CrossAttention}(Q_{\mathrm{decoder}}, K_{\mathrm{encoder}}, V_{\mathrm{encoder}}) \]
This structure is especially suitable for seq2seq tasks: both the input and output are sequences, but they do not necessarily have the same length. Typical tasks include machine translation, text summarization, paraphrasing, grammar correction, text-to-SQL, text-to-code, and image captioning in multimodal tasks. Typical models include the original Transformer, T5, BART, and others.
T5 is a typical encoder-decoder model. Its core idea is to convert all NLP tasks into a text-to-text format.
For example, a translation task can be written as:
Translate English to Chinese: That is good.
The output is: 那很好。
A summarization task can be written as:
> Summarize: long article ...
The output is: a short summary...
A classification task can also be written as:
Sentiment: I love this movie.
The output is: positive.
In this way, no matter what form the original task had, it can be unified as:
\[ \text{text input} \rightarrow \text{text output} \]
This matches the encoder-decoder structure very well. The encoder is responsible for understanding the input text, and the decoder is responsible for generating the output text.
Compared with decoder-only models, encoder-decoder models have a structural advantage on “given an input, generate an output” tasks: the input part can be modeled bidirectionally by the encoder, while the output part is still generated autoregressively by the decoder. In other words, the input text itself does not need to obey the left-to-right generation constraint; the encoder can fully understand the entire input.
8.10.5 Core Differences Between the Three Architectures
We can understand the three architectures from the perspective of attention masks. Encoder-only uses bidirectional self-attention:
\[ x_i \rightarrow x_1, x_2, \dots, x_n \]
Every position can see the whole input.
Decoder-only uses causal self-attention:
\[ x_i \rightarrow x_1, x_2, \dots, x_i \]
Every position can only see itself and previous positions.
Encoder-decoder uses both kinds of attention. In the encoder, it uses bidirectional self-attention:
\[ x_i \rightarrow x_1, x_2, \dots, x_m \]
In the decoder, it uses causal self-attention:
\[ y_t \rightarrow y_1, y_2, \dots, y_t \]
The decoder also looks at the encoder output through cross-attention:
\[ y_t \rightarrow H_{\mathrm{encoder}} \]
Structurally, we can summarize the three architectures in the following table:
| Architecture | Can it see input bidirectionally? | Typical models | Common tasks |
|---|---|---|---|
| Encoder-only | Yes | BERT | Understanding, classification, extraction, embedding |
| Decoder-only | No, causal | GPT | Continuation, dialogue, generation, instruction following |
| Encoder-Decoder | Encoder bidirectional, decoder causal | Transformer, T5, BART | Translation, summarization, input-to-output conversion |
This table is only a broad summary. In reality, different models have many variants, but the basic idea is the same.
8.10.6 When to Use Which Structure
If the task is mainly to understand an existing input and output a class, label, or vector representation, encoder-only is usually used. For example:
Given a piece of text, decide whether its sentiment is positive or negative.
This kind of task can use a BERT-style model: encode the input into contextual representations, then attach a classification head.
If the task is open-ended generation, decoder-only is usually used. For example:
Given a beginning, continue writing a piece of text.
This kind of task is inherently left-to-right generation, so GPT-style models are very suitable.
If the task is a clear input-to-output conversion, encoder-decoder is usually used. For example:
Given an article, generate a summary.
Given an English sentence, translate it into Chinese.
Given a natural-language question, generate an SQL query.
This kind of task has a complete input and an output that needs to be generated. The encoder first understands the input bidirectionally, and then the decoder generates the output autoregressively based on the input.
However, in modern large models, these boundaries are not absolute. Decoder-only models can also perform classification, translation, and summarization through prompts; encoder-decoder models can also do many generation tasks; encoder-only models can also be specially designed for retrieval or matching. So more precisely, architecture is not the only determinant of task capability. It simply provides a more natural computational form for certain tasks.
8.10.7 Why Decoder-Only Is So Popular Today
From the perspective of traditional NLP tasks, encoder-only and encoder-decoder are both natural. BERT is well suited for understanding tasks, and T5 is well suited for text-to-text tasks. But in recent years, decoder-only architectures have become very popular in large language models.
One important reason is:
Next-token prediction is a very general and very scalable training objective.
As long as we have a large amount of text, we can train the model to predict the next token:
\[ p(x_t \mid x_{<t}) \]
This means we do not need to manually annotate task-specific labels for every sample, and we do not need to organize the data into a specific input-output format.
At the same time, decoder-only models are also simple during inference: input a piece of text, and the model continues writing from it. Many tasks can be rewritten into this form. For example, question answering means inputting a question and letting the model generate the answer; translation means inputting a translation request and source sentence, then letting the model generate the translation; classification can also be written as “please decide which category the following text belongs to,” then letting the model generate the category name; code tasks can input a requirement description or existing code and let the model continue completing it.
Therefore, although these tasks look very different on the surface, to a decoder-only model they can all be reduced to the same thing:
\[ \text{prompt} \rightarrow \text{completion} \]
This uniformity is very suitable for large-scale pretraining and instruction tuning. Of course, this does not mean decoder-only is naturally better than encoder-only or encoder-decoder on every task. Architecture choice still depends on the task form, data scale, training method, and inference requirements.
8.10.8 Chapter Summary
In this section, we made a side-by-side comparison of the three common Transformer architectures.
Encoder-only uses only the Transformer Encoder, and its core is bidirectional self-attention. It lets every token see the entire input sequence, so it is well suited for text understanding, classification, extraction, and representation learning. BERT is the most typical encoder-only model.
Decoder-only uses only the Transformer Decoder with a causal mask, and its core is left-to-right autoregressive modeling. It learns the language distribution through next-token prediction and generates text step by step during inference. The GPT series is the most typical decoder-only model.
Encoder-decoder uses both an encoder and a decoder. The encoder understands the input bidirectionally, while the decoder generates output through masked self-attention and reads encoder representations through cross-attention. It is very suitable for input-to-output conversion tasks such as machine translation, summarization, and paraphrasing. The original Transformer, T5, and BART all belong to this category.
From a broader perspective, these three architectures are not isolated model families; they are different ways of combining Transformer components. Understanding their differences helps us see more clearly why BERT is good at understanding, GPT is good at generation, and T5/BART are good at text-to-text conversion.
At this point, the main line of Chapter 8 on attention and Transformer is basically complete. Next, we continue toward implementation: using Hugging Face Transformers to load real models and observe how different architectures differ in code interfaces, inputs and outputs, and attention masks.