Understanding scaled-dot product attention and multi-head attention
Hi, This post is a summary of my implementation of the scaled-dot product attention and multi-head attention. Please have a look at the colab implementation for a step through guide. Even though this post is five years too late, the best way of reviving knowledge is to write about it. Transformers are in transforming the world via ChatGPT, Bart, or LLama. The core of the transformer architecture is the self-attention layer. There are many attention mechanisms (listed in this great post by Lilian Weng), but the scaled-dot product attention layer is used in general (Vaswani et al. 2017). For a visual explanation of the transformer, look at the great post from Jay Alammar. Please check Andrej Karpathy’s video for the full implementation of a transformer from scratch. ...