Implementing a Transformer Network from scratch

Hi, This post is about my implementation of an encoder transformer network from scratch as a follow-up of understanding the attention layer together with the colab implementation. I use a simplified dataset, where I don’t expect great results. My approach is building something from scratch to understand it in depth. I faced many challenges during my implementation, so I aligned my code to the BertSequenceClassifier from huggingface. My biggest challenge was to get the network to train. This challenge took me several months of low focus and a proper de- and reconstruction of the architecture. Minor issues were missing skip connections and some data preparation issues. ...

August 28, 2023 · 6 min

Understanding scaled-dot product attention and multi-head attention

Hi, This post is a summary of my implementation of the scaled-dot product attention and multi-head attention. Please have a look at the colab implementation for a step through guide. Even though this post is five years too late, the best way of reviving knowledge is to write about it. Transformers are in transforming the world via ChatGPT, Bart, or LLama. The core of the transformer architecture is the self-attention layer. There are many attention mechanisms (listed in this great post by Lilian Weng), but the scaled-dot product attention layer is used in general (Vaswani et al. 2017). For a visual explanation of the transformer, look at the great post from Jay Alammar. Please check Andrej Karpathy’s video for the full implementation of a transformer from scratch. ...

March 18, 2023 · 3 min