Hi,

This post is a summary of my implementation of the scaled-dot product attention and multi-head attention. Please have a look at the colab implementation for a step through guide.

Even though this post is five years too late, the best way of reviving knowledge is to write about it. Transformers are in transforming the world via ChatGPT, Bart, or LLama. The core of the transformer architecture is the self-attention layer. There are many attention mechanisms (listed in this great post by Lilian Weng), but the scaled-dot product attention layer is used in general (Vaswani et al. 2017). For a visual explanation of the transformer, look at the great post from Jay Alammar. Please check Andrej Karpathy’s video for the full implementation of a transformer from scratch.

The formula is: $$ Attention(Q,K,V) = softmax(\frac{QK^T} {\sqrt{d_k}})V $$

with:

  • Q: query
  • K: key
  • V: value
  • d_k: dimensionality

In it’s simplicity, this equation currently started to transforms a whole industry. Basically, it consists of two matrix multiplications and one softmax normalization.

Scaled-dot product attention

AS you can see, the query, key and value are basically the same tensor. The first matrix multiplication calculates the similarity between the query and the key. The mask is important for decoder blocks. The final attention score is calculated via another dot product:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def attention(x, mask = None):

    query = x.clone()
    key = x.clone()
    value = x.clone()

    d_k = torch.Tensor([key.size(-1)])

    score =  query @ key.transpose(-2, -1) / torch.sqrt(d_k)

    if mask is not None:
        score = score + mask

    score = F.softmax(score, dim=-1)
    out = score @ value

    return out

Please have a look at other implementations of the attention mechanism:

and optimizations based on the attention:

Multi-head attention

The “Attention is all you need” paper proposed multi-head attention. This simple mechanism allows the model to learn a different representation of the same incoming weights. Additional multi-head attention enables the calculation of the attention weights in parallel. After the multiplication, the multi-head attention scores are concatenated. Compared to the code block above, the attention head will be implemented with the support of linear layers. As a small reminder, the linear layer without bias is a matrix multiplication.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
class SimplisticHead(torch.nn.Module):

    def __init__(self, c, head_size):

        super().__init__()
        self.query = torch.nn.Linear(n_emb, head_size, bias=False)
        self.key = torch.nn.Linear(n_emb, head_size, bias=False)
        self.value = torch.nn.Linear(n_emb, head_size, bias=False)
        self.d_k = torch.Tensor([head_size])

    def forward(self, x):

        query = self.query(x)
        key = self.key(x)
        value = self.value(x)

        score =  query @ key.transpose(-2, -1) / torch.sqrt(self.d_k)

        score = F.softmax(score, dim=-1)
        out = score @ value

        return out

With this multi-head implementation, we don’t split the incoming vector but limit the linear layer to our head size. Afterward, the attention scores will be concatenated.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class SimplisticMultiHead(torch.nn.Module):

    def __init__(self, c, num_heads, head_size):
        super().__init__()
        self.heads = torch.nn.ModuleList([SimplisticHead(c, head_size) for _ in range(num_heads)])

    def forward(self, x):

        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out

That is the overview to multi-head attention. Please look at the colab implementation, where this implementation is explained in detail.

Thank you for your attention.