8.6 Transformer Encoder: Stacking Self-Attention Layers

Author

jshn9515

Published

2026-05-03

Modified

2026-05-03

In the previous sections, we have already taken apart several of the most central components in Transformer.

Self-attention lets each token in a sequence directly exchange information with other tokens; multi-head attention lets the model observe the sequence in parallel from multiple representation spaces; positional encoding adds the order information that self-attention itself lacks.

But so far, what we have seen are still only separate components. A real Transformer Encoder does not just perform self-attention once. Instead, it organizes these components into a module that can be repeatedly stacked. This module is usually called a Transformer Encoder Block, or simply an Encoder Block or Encoder Layer.

From the overall view, a Transformer Encoder is just multiple encoder blocks stacked together:

\[ X \rightarrow \text{Embedding} + \text{Positional Encoding} \rightarrow \text{Encoder Block}_1 \rightarrow \cdots \rightarrow \text{Encoder Block}_N \]

In this section, we will look at how the Transformer encoder turns self-attention into a complete network structure.

from collections.abc import Callable

import dnnl.nn as dnn
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor

print('PyTorch version:', torch.__version__)

PyTorch version: 2.12.0+xpu

8.6.1 Why Stack Multiple Layers of Self-Attention?

Let’s first look at an intuitive question: since one layer of self-attention already lets every token see all tokens, why do we still need to stack many layers?

The reason is: seeing all tokens does not mean the model has already completed a sufficiently complex understanding.

What one layer of self-attention mainly does is let each position aggregate information from other positions based on its current representation. It does establish global connections, but this connection is only one round of information mixing. For complex language phenomena, the model usually needs multiple rounds of processing before it can form higher-level representations.

For example, consider this sentence:

The animal didn’t cross the street because it was too tired.

For the model to understand that it refers to animal, it may not be as simple as looking at the similarity between it and animal. It also needs to combine semantics, syntactic structure, commonsense relations, and the descriptions in the context. The first layer may first capture local word meanings and phrase relations; higher layers may further combine this information and form a more abstract sentence-level understanding.

So, the role of a multi-layer Transformer Encoder is not to simply repeat the same operation, but to make the representations become more abstract layer by layer.

We can roughly understand it this way:

Lower layers tend to focus more on local and word-level information;
Middle layers gradually combine phrases and contextual relations;
Higher layers tend to focus more on task-related semantic representations.

Of course, this is only an intuitive explanation. It does not mean every layer has a fixed function that humans can understand and explain. But from the network structure, stacking multiple layers does give the model the ability to repeatedly process information. This is actually the same idea as stacking CNNs or RNNs.

8.6.2 The Overall Structure of an Encoder Block

A standard Transformer Encoder block mainly contains two submodules:

Multi-Head Self-Attention
Position-wise Feed-Forward Network

In addition, each submodule is also equipped with two very important structures:

Residual Connection
Layer Normalization

So, an encoder block can be written in the following form:

\[ H = \operatorname{LayerNorm}(X + \operatorname{MultiheadAttention}(X)) \]

\[ Y = \operatorname{LayerNorm}(H + \operatorname{FFN}(H)) \]

Here, \(X\) is the input representation of the current layer, \(Y\) is the output representation of the current layer, and FFN is the Position-wise Feed-Forward Network.

From the data flow, it looks like this:

Figure 1: Transformer Encoder Block structure diagram

This is the most basic encoder block structure.

One thing to note is that the writing above is the easier-to-understand Post-LN form. That is, the submodule is applied first, and then the residual connection and LayerNorm are applied. The original Transformer used this structure. Later, many modern Transformers use the Pre-LN structure instead, which means LayerNorm is applied to the input first before sending it into the submodule. When we write the code later, we will use the more common and more stable Pre-LN version.

8.6.2.1 First Part: Multi-Head Attention

The first part of an encoder block is multi-head attention. One thing to note is that the attention here is self-attention, and \(Q\), \(K\), and \(V\) all come from the same input \(X\):

\[ Q = XW_Q,\quad K = XW_K,\quad V = XW_V \]

Then we compute:

\[ \operatorname{MultiheadAttention}(Q,K,V) \]

Here, each head in multi-head attention performs scaled dot-product attention in a different projection space.

The role of this part is to let different tokens exchange information with each other. After self-attention, the representation of each token no longer contains only its own word meaning and position information. Instead, it has fused information from the context of the whole sentence. So, by continuously stacking encoder blocks, we can gradually turn a sequence of originally relatively isolated token embeddings into a sequence of contextualized token representations.

8.6.2.2 Second Part: Residual Connection

After multi-head attention, Transformer does not directly send the attention output to the next step. Instead, it adds the original input back:

\[ X + \operatorname{MultiheadAttention}(Q,K,V) \]

This is the Residual Connection, which is also an important structure we learned earlier in ResNet.

There are two benefits to doing this. First, it can preserve the original information. If some information does not need to be modified for now, the residual path makes it easier to pass that information to later layers. Second, it can alleviate the difficulty of training deep networks. Without residual connections, the output of each layer must pass through the submodule itself, and the original input information must be learned and preserved by this submodule. As more and more encoder blocks are stacked, this makes the transmission of both information and gradients more difficult. The residual connection provides a more direct information path, allowing the original representation to continue passing backward, while the submodule only needs to learn incremental modifications based on the original representation. This makes deep Transformers easier to train.

So, the Transformer encoder is not simply one layer overwriting another layer. Instead, each layer gradually revises the original representation.

8.6.2.3 Third Part: Layer Normalization

After the residual connection, Transformer usually applies a Layer Normalization. The role of LayerNorm is to normalize the feature dimension of each token. Suppose the representation of a certain token is:

\[ X \in \mathbb{R}^{d_\mathrm{model}} \]

LayerNorm computes the mean and variance inside this vector, and then standardizes it:

\[ \operatorname{LayerNorm}(X) = \gamma \frac{X - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \]

Here, \(\gamma\) and \(\beta\) are learnable parameters.

Intuitively, LayerNorm can make the representation scale of each layer more stable. Otherwise, after multiple layers of attention, FFN, and residual connections, the mean and variance of activation values may keep changing, and the representation scale may also grow or shrink layer by layer. This makes deep networks harder to train. By normalizing the feature dimension of each token, LayerNorm helps the model maintain a relatively stable activation distribution, thereby alleviating training instability.

As for why Transformer more commonly uses LayerNorm instead of BatchNorm, the main reason is that LayerNorm does not depend on batch statistics. For variable-length sequences, the sentence lengths, amount of padding, and token positions inside a batch can all be different. If BatchNorm is used, the normalization result will be affected by the composition of the current batch. LayerNorm normalizes the feature dimension of each token itself, so it is more suitable for sequence models like Transformer.

8.6.2.4 Fourth Part: Position-wise Feed-Forward Network

The fourth part of an encoder block is the feed-forward network, usually abbreviated as FFN. Its form is very simple:

\[ \operatorname{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2 \]

Here, \(\sigma\) is a nonlinear activation function. The original Transformer used ReLU, while modern models also often use GELU or SiLU.

If the input dimension is \(d_\mathrm{model}\), the FFN usually first expands the dimension to a larger \(d_\mathrm{ff}\), and then projects it back to \(d_\mathrm{model}\):

\[ d_\mathrm{model} \rightarrow d_\mathrm{ff} \rightarrow d_\mathrm{model} \]

For example, a common setting is:

\[ d_\mathrm{ff} = 4d_\mathrm{model} \]

In other words, the FFN first maps each token representation into a higher-dimensional space for nonlinear transformation, and then compresses it back to the original dimension.

Here we need to pay attention to one keyword: position-wise. Position-wise means that this FFN acts independently on each position. It does not directly let different tokens exchange information. That is, for the input:

\[ H = [h_1, h_2, \dots, h_n] \]

The FFN does this:

\[ \operatorname{FFN}(H) = [\operatorname{FFN}(h_1), \operatorname{FFN}(h_2), \dots, \operatorname{FFN}(h_n)] \]

Different positions use the same set of FFN parameters, but they are computed independently.

At this point, we can understand the division of labor between the two major modules in an encoder block as:

Self-attention is responsible for information exchange between tokens;
FFN is responsible for further nonlinear processing of each token representation.

One module mixes context, and the other processes representations. Only when the two are combined do we get a complete Transformer Encoder block.

8.6.3 Pre-LN and Post-LN

For easier understanding, we wrote the encoder block earlier as:

\[ H = \operatorname{LayerNorm}(X + \operatorname{MultiheadAttention}(X)) \]

\[ Y = \operatorname{LayerNorm}(H + \operatorname{FFN}(H)) \]

This writing is called Post-LN, because LayerNorm is placed after the submodule and residual connection.

But in many modern Transformer implementations, Pre-LN is more common:

\[ H = X + \operatorname{MultiheadAttention}(\operatorname{LayerNorm}(X)) \]

\[ Y = H + \operatorname{FFN}(\operatorname{LayerNorm}(H)) \]

That is, we first apply LayerNorm to the input, then send it into attention or FFN, and finally add the residual back.

Conceptually, the core structure of Pre-LN and Post-LN is the same: both have attention, FFN, residual connections, and LayerNorm. The only difference is whether LayerNorm is placed before or after the submodule. Research has found that one benefit of Pre-LN compared with Post-LN is that training is more stable, especially when the model is very deep. This is because the residual path is more direct, and gradients can propagate more smoothly along the residual branch.

To make the code closer to modern implementations, we will use the Pre-LN version below.

8.6.4 PyTorch Implementation of Transformer Encoder Block

Below, we implement a simplified Transformer Encoder block. To avoid repeatedly implementing multi-head attention, we directly use the MultiheadAttention implemented in Section 8.4.

class TransformerEncoderLayer(nn.Module):
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        dim_feedforward: int = 2048,
        activation: Callable[[Tensor], Tensor] = F.relu,
        bias: bool = True,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.self_attn = dnn.MultiheadAttention(
            d_model,
            num_heads,
            bias=bias,
            dropout=dropout,
        )

        self.linear1 = nn.Linear(d_model, dim_feedforward, bias=bias)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model, bias=bias)
        self.activation = activation

        self.norm1 = nn.LayerNorm(d_model, bias=bias)
        self.norm2 = nn.LayerNorm(d_model, bias=bias)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(
        self,
        src: Tensor,
        src_mask: Tensor | None = None,
        src_key_padding_mask: Tensor | None = None,
    ) -> Tensor:
        x = src + self._sa_block(
            self.norm1(src),
            attn_mask=src_mask,
            key_padding_mask=src_key_padding_mask,
        )
        x = x + self._ff_block(self.norm2(x))
        return x

    def _sa_block(
        self,
        x: Tensor,
        attn_mask: Tensor | None,
        key_padding_mask: Tensor | None,
    ) -> Tensor:
        x, _ = self.self_attn(
            x, x, x,
            attn_mask=attn_mask,
            key_padding_mask=key_padding_mask,
            need_weights=False,
        )  # fmt: skip
        x = self.dropout1(x)
        return x

    def _ff_block(self, x: Tensor) -> Tensor:
        x = self.linear1(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.linear2(x)
        x = self.dropout2(x)
        return x


x = torch.randn(2, 32, 512)  # (batch_size, seq_len, d_model)
encoder_layer = TransformerEncoderLayer(d_model=512, num_heads=8)

with torch.inference_mode():
    output = encoder_layer(x)

print('Encoder block output shape:', output.shape)

Encoder block output shape: torch.Size([2, 32, 512])

The most important parts of this code are the two residual paths:

x = x + self.dropout1(attn_out)

and:

x = x + self.dropout2(ffn_out)

The first residual path corresponds to multi-head self-attention, and the second residual path corresponds to FFN.

8.6.5 Stacking Multiple Encoder Blocks

After we have a single encoder block, writing a complete Transformer Encoder becomes very easy. We only need to stack it multiple times.

class TransformerEncoder(nn.Module):
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        num_layers: int,
        dim_feedforward: int = 2048,
        activation: Callable[[Tensor], Tensor] = F.relu,
        bias: bool = True,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.layers = nn.ModuleList(
            [
                TransformerEncoderLayer(
                    d_model=d_model,
                    num_heads=num_heads,
                    dim_feedforward=dim_feedforward,
                    activation=activation,
                    bias=bias,
                    dropout=dropout,
                )
                for _ in range(num_layers)
            ]
        )
        self.norm = nn.LayerNorm(d_model)

    def forward(
        self,
        src: Tensor,
        mask: Tensor | None = None,
        src_key_padding_mask: Tensor | None = None,
    ) -> Tensor:
        output = src
        for layer in self.layers:
            output = layer(
                output,
                src_mask=mask,
                src_key_padding_mask=src_key_padding_mask,
            )

        output = self.norm(output)
        return output


x = torch.randn(2, 32, 512)  # (batch_size, seq_len, d_model)
encoder = TransformerEncoder(d_model=512, num_heads=8, num_layers=6)

with torch.inference_mode():
    output = encoder(x)

print('Encoder output shape:', output.shape)

Encoder output shape: torch.Size([2, 32, 512])

One thing to note is that the input x of this encoder is already the representation after token embedding and position embedding. In other words, before sending it into the encoder, we need to first convert token ids into embedding vectors and add positional encoding.

The original sequence input is token ids, and its shape is usually:

(batch_size, seq_len)

After token embedding and position embedding, it becomes:

(batch_size, seq_len, d_model)

Then it passes through multiple encoder blocks one by one, and the final output is still:

(batch_size, seq_len, d_model)

This means the encoder outputs one contextualized representation for each token in the input sequence. For example, if the input sentence has 10 tokens, then the encoder output will also have 10 vectors. Each vector corresponds to one position, but it has already fused contextual information from the whole sentence.

8.6.6 Padding Mask: Ignoring Padded Positions

In actual training, the sentences inside a batch usually have different lengths. To combine them into one tensor, we pad shorter sentences to the same length.

For example:

I love deep learning [PAD] [PAD]
Transformers are very powerful models

Here, [PAD] is only used to align the lengths. It is not real input content. When the model performs self-attention, it should not attend to these padding tokens. But how do we tell the model which positions are padding?

This requires key_padding_mask.

In PyTorch’s nn.MultiheadAttention, the shape of key_padding_mask is usually:

(batch_size, seq_len)

Here, True means this position should be masked out, meaning other tokens are not allowed to attend to it.

For example:

key_padding_mask = input_ids == pad_token_id

Then pass it into the Encoder:

output = encoder(input_ids, key_padding_mask=key_padding_mask)

In this way, self-attention will ignore the padding positions during computation.

This is very important in NLP tasks. Otherwise, the model may treat [PAD] as a meaningful token, which would affect representation learning.

8.6.7 What Can the Encoder Output Be Used For?

The output of a Transformer Encoder is a set of contextualized token representations:

\[ H = [h_1, h_2, \dots, h_n] \]

Different tasks use these outputs in different ways.

For sequence labeling tasks, such as named entity recognition, we usually use the output at each position:

\[ h_1, h_2, \dots, h_n \]

because each token needs a prediction result.

For sentence classification tasks, such as sentiment classification, we usually need to represent the whole sentence as one vector. There are two common approaches.

The first is to add a special [CLS] token, and then use the output at the [CLS] position as the whole-sentence representation:

\[ h_{\mathrm{cls}} \]

The second is to apply pooling to the outputs of all tokens, such as average pooling:

\[ h_{\mathrm{avg}} = \frac{1}{n}\sum_{i=1}^{n}h_i \]

This is also why Transformer Encoder can be used for many different tasks. It is responsible for encoding the input sequence into contextualized representations. As for whether we attach a classification head, a tagging head, or a retrieval head afterward, that depends on the specific task.

8.6.8 Chapter Summary

In this section, we combined the self-attention, multi-head attention, and positional encoding that we learned earlier to form the complete Transformer Encoder.

An encoder block mainly consists of two parts: multi-head self-attention and a position-wise feed-forward network. The former is responsible for letting different tokens exchange contextual information, while the latter is responsible for nonlinear processing of each token representation. Each submodule is also equipped with a residual connection and LayerNorm, making deep networks easier to train.

After multiple encoder blocks are stacked, the model can perform multiple rounds of information interaction and representation processing on the input sequence. In the final output, each token vector is no longer just its own embedding, but a representation that has fused the whole context.

At this point, we have finished the core structure of the Transformer Encoder. In the next section, we continue forward and look at the Transformer decoder. It is very similar to the encoder, but it has one key restriction: during generation, the current position cannot peek at future tokens. Otherwise, the model would not be predicting the next word; it would be copying the answer. This leads us to masked self-attention.

Reuse

CC BY-NC 4.0