Part 4: Attention Mechanism and Transformer

Author

jshn9515

Published

2026-05-05

Modified

2026-07-06

Title	Author	Date
10.1 Why Attention Is IO-Bound	jshn9515	2026-03-19
10.2 FlashAttention v1: Eliminating the IO Bottleneck in Attention Mechanisms	jshn9515	2026-03-19
8.1 Bahdanau Attention: From Information Compression to Dynamic Retrieval	jshn9515	2026-04-09
8.10 Three Different Transformer Architectures: Understanding, Generation, and Input-Output Conversion	jshn9515	2026-05-08
8.11 Hugging Face Transformers API: From Structure to Calls	jshn9515	2026-05-09
8.2 Cross-Attention: One Sequence Querying Another Sequence	jshn9515	2026-04-09
8.3 Self-Attention: Internal Information Interaction Within a Sequence	jshn9515	2026-04-09
8.4 Multi-Head Attention: From Single Perspective to Multiple Perspectives	jshn9515	2026-04-09
8.5 Positional Encoding: Adding Positional Information to Attention	jshn9515	2026-04-09
8.6 Transformer Encoder: Stacking Self-Attention Layers	jshn9515	2026-05-03
8.7 Transformer Decoder: Masked Self-Attention and Cross-Attention	jshn9515	2026-05-05
8.8 Encoder-Decoder Transformer: Connecting Encoder and Decoder	jshn9515	2026-05-05
8.9 KV Cache: Why We Don’t Recompute the Past During Inference	jshn9515	2026-05-05

Reuse