deep-learning-notes
中文
English
Part 4: Attention Mechanism and Transformer
Part 1: Deep Learning Fundamentals
Chapter 1: Introduction to Deep Learning
1.1 Neural Networks: A Learnable Function
1.3 Forward Propagation, Backpropagation, and Computation Graph
Chapter 2: Getting Started with PyTorch
2.1 Automatic Differentiation in PyTorch
2.2 Gradient Recording and Control in PyTorch
Part 4: Attention Mechanism and Transformer
Chapter 8: Attention and Transformer: From Dynamic Retrieval to Sequence Modeling
8.1 Bahdanau Attention: From Information Compression to Dynamic Retrieval
8.2 Cross-Attention: One Sequence Querying Another Sequence
8.3 Self-Attention: Internal Information Interaction within a Sequence
8.4 Multi-Head Attention: From Single Perspective to Multiple Perspectives
8.5 Positional Encoding: Adding Positional Information to Attention
8.6 Transformer Encoder: Stacking Self-Attention Layers
8.7 Transformer Decoder: Masked Self-Attention and Cross-Attention
8.8 Encoder-Decoder Transformer: Connecting Encoder and Decoder
8.9 KV Cache: Why We Don’t Recompute the Past During Inference
8.10 Three Different Transformer Architectures: Understanding, Generation, and Input-Output Conversion
8.11 Hugging Face Transformers API: From Structure to Calls
Chapter 10: Efficient Attention Implementations: From Memory-Efficient Attention to FlashAttention
10.1 Why Attention is IO-Bound
10.2 Flash Attention v1: Eliminating the IO Bottleneck in Attention Mechanisms
Part 6: Generative Models
Chapter 12: GAN: Learning to Generate through Adversarial Training
12.1 Generative Adversarial Network (GAN)
Chapter 13: VAE: From Compression and Reconstruction to Probabilistic Generation
13.1 AutoEncoder: Starting with Compression and Reconstruction
13.2 VAE: Probabilistic Modeling and the Reparameterization Trick
13.3 ELBO: Where Does the VAE Objective Function Come From?
13.4 VAE Training Phenomena and Latent Space Intuition
13.5 Advantages, Limitations, and Future Developments
Chapter 14: Diffusion Models: From Denoising to Generation
14.1 DDPM: From Denoising to Generation
14.2 The Forward Process of DDPM: From Image to Noise
14.3 DDPM’s Reverse Denoising Process and Training Objective
14.4 DDPM Network Structure and Sampling Process
14.5 DDPM from a Variational Derivation: Where Does the ELBO Come From?
Part 7: Multimodal Learning
Chapter 15: Vision-Language Models: From Image-Text Alignment to Multimodal Dialogue
15.1 CLIP: Connecting Images and Language with Contrastive Learning
Part 4: Attention Mechanism and Transformer
Author
jshn9515
Published
2026-05-05
Modified
2026-05-21
Order By
Default
Date - Oldest
Date - Newest
Title
Author
Title
Author
Date
10.1 Why Attention is IO-Bound
jshn9515
2026-03-19
10.2 Flash Attention v1: Eliminating the IO Bottleneck in Attention Mechanisms
jshn9515
2026-03-19
8.1 Bahdanau Attention: From Information Compression to Dynamic Retrieval
jshn9515
2026-04-09
8.10 Three Different Transformer Architectures: Understanding, Generation, and Input-Output Conversion
jshn9515
2026-05-08
8.11 Hugging Face Transformers API: From Structure to Calls
jshn9515
2026-05-09
8.2 Cross-Attention: One Sequence Querying Another Sequence
jshn9515
2026-04-09
8.3 Self-Attention: Internal Information Interaction within a Sequence
jshn9515
2026-04-09
8.4 Multi-Head Attention: From Single Perspective to Multiple Perspectives
jshn9515
2026-04-09
8.5 Positional Encoding: Adding Positional Information to Attention
jshn9515
2026-04-09
8.6 Transformer Encoder: Stacking Self-Attention Layers
jshn9515
2026-05-03
8.7 Transformer Decoder: Masked Self-Attention and Cross-Attention
jshn9515
2026-05-05
8.8 Encoder-Decoder Transformer: Connecting Encoder and Decoder
jshn9515
2026-05-05
8.9 KV Cache: Why We Don’t Recompute the Past During Inference
jshn9515
2026-05-05
No matching items
Reuse
CC BY-NC 4.0