🔍 Demystifying Scaling in Self-Attention: Why Divide by √dₖ? | by Navdeep Sharma | Feb, 2025

Greetings,

In the previous blog Decoding Self-Attention: A Deep Dive into Contextual Word Embeddings i had discussed about self attention mechanism in depth and tried to derive it using first principles. If you haven’t read it yet i strongly recommend you to go through it once as it will help you to understand this blog in an easy way. Incase you know about self attention mechanism then you can continue reading this blog.

So now our final Self attention mechanism that generates task specific contextual embeddings can be written in a mathematical form in below way —

But wait, if you read the paper Attention is all you need (in which self attention mechanism has been proposed), there the dot product between Q and K vector is scaled down by a value of √(dₖ) and final formula looks like —

Let us understand what is √(dₖ). dₖ is nothing but dimensionality of key vector. It is important to note here that the dimensionality of query, key and value vectors can be different because they are generated by multiplying (dot product) the embedding vector of word with query matrix (W_q), key matrix (W_k) and…

🔍 Demystifying Scaling in Self-Attention: Why Divide by √dₖ? | by Navdeep Sharma | Feb, 2025

Recent Articles

Google DeepMind Researchers Unlock the Potential of Decoding-Based Regression for Tabular and Density Estimation Tasks

Building AI Application with Gemini 2.0

Watch Out For These 8 Cloud Security Shifts in 2025

Peacock Promo Codes and Coupons

Boost team innovation, productivity, and knowledge sharing with Amazon Q Apps

Related Stories

Leave A Reply Cancel reply