Posts

Endpoint validation

Hi, In my previous job, I spent hours debugging internal data transformations to figure out the received data from an external API was faulty. This issue would not appeared with schema validation. My fault was that I trusted the incoming data and didn’t check for data consistency. Learning from mistakes and saving time, I would set up a small example for JSON validation via Pydantic. FastAPI relies heavily on pydantic and I use it for validating the incoming request and outgoing response. Anyway, not in every project FastAPI is used. ...

Fast data transfer to or from s3

Hi, This post is an homage to a stackoverflow post copying data from s3. This shared work saved me a lot of time. I believe that individuals who share their work do not receive sufficient recognition. The problem is that I have multiple Gb of data separated into thousands of files. Those files are selected for download by the semi-automated pipeline for model training. So the number of files to download varies from pipeline run to pipeline run. Also, this makes any data preparation obsolete. The solution from the official boto3 documentation for copying data from s3 takes too long. Even with asynchronous execution in the download, it will take a few hours to download those files. Imagine a scenario where you want to fine-tune a deep learning model on a machine with multiple GPUs, but you have to wait several hours for the data to be copied 😱. Any preprocessing steps are not feasible since the data is filtered upon request. Additionally, downloading the data via aws cli is not an option, as there is much more data in the s3 buckets than requested for model training. The simplest approach is to increase the throughput. And here is the beauty, directly copy+pasted from Pierre D: ...

Training a language model from scratch

Hi, This post is a short overview over a work project, where I trained a language model for invoices. This so-called base model is then fine-tuned for text classification on customer data. Due to data privacy, a non-disclosure agreement, ISO 27001 and SOAP2, I’m not allowed to publish any results. Believe me, it works like 🚀✨🪐. A language model is trained on large amounts of textual data to understand the patterns and structure of language. The primary goal of a language model is to predict the probability of the next word or sequence of words in a sentence given the previous words. ...

Cookie-cutter Problems

Hi, Recently, I started to put some scripts together and run them against a Kaggle dataset. I decided to train my skills on an unseen dataset. Training keeps me sharp, and I need it to complement my skill set. For the last 2,5 years, I struggled in a small team with NLP problems, where I worked mostly on engineering tasks. My understanding in this area is not where I wanted to be. And on top, I follow this natural human process called forgetting things. For example, I definitely can’t write all relevant stochiometric formulas of the chemo-lithotrophic denitrification by memory. This was very important during my Ph.D. When did I start to forget relevant information? It’s funny to think back. 🤷 ...

Stratified multi-label split

Hi, This post is a short overview of a stratified multi-label train-test split. Please look at the colab implementation for a step through guide. Sometimes you step into work problems, which justify a small post. I already saw colleagues struggling to balance the train-test split for multi-label classification. In classification problems, we have often a dataset with an imbalanced number of classes. In general, it is desired to keep the proportions of each label for the train and test sets as observed as in the original dataset. This stratified train-test split works well with single-label classification problems. For multi-label classification it is unclear how stratified sampling should be performed. Therefor Sechidis et al. 2011 and Szymanski and Kajdanowicz 2017 developed an algorithm to provide balanced datasets for multi-label classification. The documentation of their algorithm can be found in the scikit-multilearn package and on github. ...

Understanding scaled-dot product attention and multi-head attention

Hi, This post is a summary of my implementation of the scaled-dot product attention and multi-head attention. Please have a look at the colab implementation for a step through guide. Even though this post is five years too late, the best way of reviving knowledge is to write about it. Transformers are in transforming the world via ChatGPT, Bart, or LLama. The core of the transformer architecture is the self-attention layer. There are many attention mechanisms (listed in this great post by Lilian Weng), but the scaled-dot product attention layer is used in general (Vaswani et al. 2017). For a visual explanation of the transformer, look at the great post from Jay Alammar. Please check Andrej Karpathy’s video for the full implementation of a transformer from scratch. ...

The importance of building things by yourself

Hi, In this initial post, I want to draft the development of my FastAPI skeleton. At the beginning of my career as a Data Scientist, I ran into the typical problem of model deployment to production. In a team of two scientists, I had the chance to write a micro-service with Flask from scratch out of necessity. My first service followed strongly the example of Miguel Grinbergs great tutorial. The reason was simple, I couldn’t write proper code at this time. Besides no experience and the great help of my co-workers, I could write a production-ready micro-service in a few weeks with the following features: ...