šŸ” Demystifying Scaling in Self-Attention: Why Divide by āˆšdā‚–? | by Navdeep Sharma | Feb, 2025


Greetings,

In the previous blog Decoding Self-Attention: A Deep Dive into Contextual Word Embeddings i had discussed about self attention mechanism in depth and tried to derive it using first principles. If you havenā€™t read it yet i strongly recommend you to go through it once as it will help you to understand this blog in an easy way. Incase you know about self attention mechanism then you can continue reading this blog.

So now our final Self attention mechanism that generates task specific contextual embeddings can be written in a mathematical form in below way ā€”

But wait, if you read the paper Attention is all you need (in which self attention mechanism has been proposed), there the dot product between Q and K vector is scaled down by a value of āˆš(dā‚–) and final formula looks like ā€”

Let us understand what is āˆš(dā‚–). dā‚– is nothing but dimensionality of key vector. It is important to note here that the dimensionality of query, key and value vectors can be different because they are generated by multiplying (dot product) the embedding vector of word with query matrix (W_q), key matrix (W_k) andā€¦

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here