Now a days AI is booming and many of the AI models are coming. If you are a engineer or if you interested to work with AI models or machine learning models, you must know about the Architecture of Transformers.
Today, most of the llm’s (also know as models) are using the Transformers in their Architecture.
Below I created a flow chart of Transformers Architecture and technical explanation about it, so you can understand and remember it easily…
Input Sentence: "I love Transformers"
│
▼
[1] Input Embedding + Positional Encoding
(Converts words into numbers and adds positional info)
- "I" → vector [0.3, 0.8, ...]
- "love" → vector [0.1, 0.9, ...]
- "Transformers" → vector [0.6, 0.5, ...]
│
▼
[2] ENCODER (Repeated N times, typically 6):
┌─────────────────────────────────────────┐
│ Multi-Head Self-Attention │
│ (Each word learns context from others) │
│ - "love" relates strongly to "I" │
│ - "Transformers" related contextually │
│ │ │
│ ▼ │
│ Add & Norm (stabilizes training) │
│ │ │
│ ▼ │
│ Feed Forward (further processing) │
│ (Refines the meaning of each word) │
│ - Enhances meanings, e.g., emotion │
│ │ │
│ ▼ │
│ Add & Norm (stabilizes again) │
└───────────┘
│
▼
[2] Encoder Output
(context-aware representation of input sequence)
│
▼
[3] Decoder (Starts generating the output, e.g., translation)
┌───────────┐
│Masked Multi-Head Attention │
│(Predict next word based on previous words generated)│
│ - Predicting next French word: "J'" → "aime" │
│ (Only looks at previous words it's already generated)│
│ │ │
│ ▼ │
│ Add & Norm │
│ (stabilizes training) │
│ │ │
│ ▼ │
│ Multi-Head Encoder-Decoder Attention │
│ (Looks back at Encoder Output to │
│ choose relevant words) │
│ - "love" → "aime" │
│ │ │
│ ▼ │
│ Add & Norm │
│ (stabilizes again) │
│ │ │
│ ▼ │
│ Feed Forward │
│ (Further refines word predictions) │
│ │ │
│ ▼ │
│ Add & Norm (final stabilization) │
└───────────┘
│
▼
[4] Linear & Softmax Layer
(converts output vectors to word probabilities)
- Probability for "J'aime" (French for "I love")
- Probability for "les"
- Probability for "Transformers"
│
▼
[5] Output Probabilities
- Highest probability words selected as final output.
Example: Input English sentence "I love Transformers"
→ Output French sentence: "J'aime les Transformers."
-----------------------------------------------------------------------------
**Simple Example Explained:**
- **Input**: "I love Transformers"
- **Output**: "J'aime les Transformers"
- **What happens inside**:
- Encoder learns context ("love" related strongly to "I").
- Decoder, step-by-step, uses this learned context to accurately predict each word in French.This flowchart and example give you an intuitive, step-by-step visualization of how the Transformer architecture processes data to generate meaningful outputs.
Here’s an in-depth technical explanation of the Transformer architecture, clearly tied to the simplified step-by-step flowchart above, and maintaining a clear link between each stage and the internal workings:
Let’s use the translation example:
English: "I love Transformers"
French: "J'aime les Transformers"
- Transformer receives input sentence
"I love Transformers"
.
- Words are converted into dense numeric vectors (e.g., 512-dimensional), capturing semantic meanings.
- Example:
"I"
→[0.2, 0.4, …, 0.6]
"love"
→[0.9, 0.3, …, 0.1]
"Transformers"
→[0.8, 0.7, …, 0.3]
- Since Transformers do not use recurrence (like RNNs), positional encoding provides each word’s position explicitly.
- Implemented via sine and cosine functions:
- Adds position information uniquely for each position.
- Allows the model to differentiate identical words based on their positions.
Example (simplified):
Position 1 ("I") → [sin(1), cos(1), sin(2), cos(2), ...]
Position 2 ("love") → [sin(2), cos(2), sin(4), cos(4), ...]
Position 3 ("Trans.")→ [sin(3), cos(3), sin(6), cos(6), ...]
Each encoder layer has:
(a) Multi-Head Self-Attention
- Computes relationships between all words simultaneously.
- For each word:
- Query (Q): Represents the current word we’re focusing on.
- Key (K): Represents all words we compare against.
- Value (V): Represents all words whose contextual info we aggregate.
Mathematically:
Attention(Q,K,V) = softmax(QKᵀ/√dₖ) V
QKᵀ
: Dot product (score matrix) represents how relevant words are to each other.- Scaled by
√dₖ
to prevent large gradients. - Softmax to normalize scores (probability-like weights).
Example: “love” attends strongly to “Transformers” more than “I”.
Why Multi-Head?
- Instead of one set of Q, K, V, it uses multiple parallel sets (heads), enabling attention from multiple perspectives simultaneously.
- Each head captures unique relationships between words.
(b) Add & Norm (Residual Connection + LayerNorm)
- Residual connections add the original input of the layer to its output, reducing training difficulties (vanishing gradients).
- Layer normalization stabilizes the output distribution.
LayerNorm(x + Sublayer(x))
© Feed-Forward Network
- A two-layer fully connected neural network applied independently at each position.
- Typically involves:
- Linear Transformation → ReLU activation → another Linear Transformation.
- Helps model nonlinear interactions and further refines the representation.
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
- Followed again by another Add & Norm step.
- Final encoder output is a contextualized representation of the input.
- Captures meaning & context relationships across entire input.
- Decoder begins translating, starting with a
<START>
token embedding.
Each decoder layer includes three sublayers:
(a) Masked Multi-Head Self-Attention
- Similar to encoder self-attention, but masks future tokens.
- Ensures predictions for position
t
depend only on positions< t
. - Masking implemented by setting scores to negative infinity for future positions, thus softmax ignores future tokens.
Example: while predicting
"aime"
, it can see"J'"
but not"les"
.
(b) Encoder-Decoder Attention (Cross Attention)
- Decoder queries the encoder’s output (encoded input sequence) for relevant contextual information.
- Queries (Q): from decoder’s previous output.
- Keys (K) & Values (V): from encoder’s output.
Attention(Q_dec, K_enc, V_enc) = softmax(Q_dec K_encᵀ / √dₖ) V_enc
Example: “aime” strongly attends to encoder’s
"love"
vector.
© Feed-Forward Network
- Same as encoder’s FFN, providing deeper refinement.
Each of these sublayers also includes Add & Norm (residual + normalization).
- Final decoder layer output vectors converted into word probabilities.
- Linear layer projects output into vocabulary size vectors.
- Softmax generates probability distribution.
probabilities = softmax(Linear(decoder_output))
Example (simplified):
"J'" → [0.9 probability]
"aime" → [0.95 probability]
"les" → [0.92 probability]
"Transformers" → [0.98 probability]
- Decoder generates words sequentially, choosing highest probability word at each step.
- Stops translating upon generating
<END>
token or reaching max length.
Final output example:
Input: "I love Transformers"
Output: "J'aime les Transformers."
- Encoder: Embedding → Positional encoding → Self-Attention → Feed Forward → Output context vectors.
- Decoder: Embedding → Positional encoding → Masked Self-Attention → Encoder-Decoder Attention → Feed Forward → Softmax → Output sequence.
- Parallel processing: Self-attention layers process all tokens simultaneously.
- Context awareness: Each token directly relates to all other tokens.
- No vanishing gradient: Residual connections ensure deep model stability.
- Dynamic understanding: Multi-head attention captures diverse context and meanings.
This detailed explanation tied to your simplified flowchart should help clearly understand each technical aspect of the Transformer architecture step-by-step.