Welcome

  • Hi, this is Steffen. I enjoy building machine learning systems including model development. Followed by playing experimenting to get a better understanding 馃挙馃挙
  • In my previous life, I ran numerical simulation to understand the chemical interactions between water and our environment. 馃實馃寠馃尰

Understanding approximate nearest neighbor algorithm

Hi, This post is about the approximate nearest neighbor (ANN) algorithm. The code for this post is here, where I provide an example of using a framework and a python implementation. Most python implementation were written with the help of a LLM. I鈥檓 amazed, how helpful they are for learning new things. I see them like a drunken professor, which with the right approach will be a very helpful tool. As a next step in understanding RAGs, I want to have a closer look at approximate nearest neighbor algorithms. Basically, the purpose is to find the closest vector to a query vector in a database. Since I鈥檓 also interested into the implementation, I follow mostly this amazing blog post. Vector search is the basic component of vector databases and their main purpose. ANN algorithms are looking for a close match instead of an exact match. This loss of accuracy allows an improvement of efficieny, which allows the search through much bigger datasets, high-dimensional data and real-time apllications. ...

April 19, 2025 路 6 min

Short example of Information Retrieval

Hi, Some time ago, I did a small project on information retrieval. I think, it\s a good idea to share it with all its shortcomings. Here is the code. Sadly, the LLM part doesn鈥檛 work with the quantized model, so I commented it out. The project is a small information retrieval of a FAQ, where I want to map the correct answer to a question. In my example, it鈥檚 a 1:1 mapping between question and answer, but it also works with multiple answers. ...

March 10, 2025 路 2 min

Get embeddings for multiple data sources

Hi, Following my first short post about RAGs, I would like to provide a brief overview about embeddings, which are used to find similiar objects in a vector database. To better understand how various transformer models handle different input data types, I created this notebook. I explore therefor, text, image, audio and video data. I鈥檝e chosen to skip the more traditional text embeddings (TF-IDF, Word2Vec or GloVe), because there are already very good tutorials available. Additionally, I plan to discuss the training of embedding models in a separate blog post. For this post, I use mostly pretrained classification models, where I use the last layer before the prediction head as embedding. ...

January 2, 2025 路 1 min

Overview of RAG (Retrieval-Augmented Generation) systems

Hi, It鈥檚 been a while since my last post, mostly because of my own laziness. Over the past year, I鈥檝e been working on several projects, one of which is a small RAG (Retrieval-Augmented Generation) system. I implemented it to combine external knowledge (in this case internal safety documents) with a large language model (LLM). This approach allows the use of data that the LLM wasn鈥檛 trained on and also helps reduce hallucinations. ...

December 27, 2024 路 4 min

Deep Learning model explainability

Hi, In my first post, I looked into the explainability of classical machine learning models. As a next step, I鈥檓 interested in the explainability of neural networks. Model explainability is easy for simple models (linear regression, decision trees), and some tools exist for more complex algorithms (ensemble trees). Therefore, I highly recommend the book Interpretable Machine Learning by Christoph Molnar for a deeper theoretical understanding. All different approaches for model explanability are shown with a PyTorch model in this kaggle notebook. ...

December 8, 2023 路 7 min

Model explainability

Hi, Some months have passed since my last post. Model explainability is easy for simple models (linear regression, decision trees), and some tools exist for more complex algorithms (ensemble trees). I want to dig into the tools to interpret more complex models with this post. Therefore, I highly recommend the book Interpretable Machine Learning by Christoph Molnar for a deeper theoretical understanding. All different approaches for model explanability are shown with a RandomForest model in this kaggle notebook. ...

November 21, 2023 路 5 min

Implementing a Transformer Network from scratch

Hi, This post is about my implementation of an encoder transformer network from scratch as a follow-up of understanding the attention layer together with the colab implementation. I use a simplified dataset, where I don鈥檛 expect great results. My approach is building something from scratch to understand it in depth. I faced many challenges during my implementation, so I aligned my code to the BertSequenceClassifier from huggingface. My biggest challenge was to get the network to train. This challenge took me several months of low focus and a proper de- and reconstruction of the architecture. Minor issues were missing skip connections and some data preparation issues. ...

August 28, 2023 路 6 min

Learning about time-series analysis

Hi, Recently, I had to work on a simple time-series analysis. I performed poorly since I never worked with time-series before. I believe in a deterministic world, and in general, I prefer to find the causality of a specific data behavior prior to a simple way of empiristic modeling. However, I understand the need for time-series analysis as not enough data available, the underlying processes understood, the complexity bearable, or the time/need for a proper process understanding. The goal is to make a prediction based on the previously observed observations. In a traditional sense (Arima), you look at the trend, seasonality, and cycles - in the more modern way, you throw the data into a model architecture (deep learning). In this context, I should mention the famous paper Statistical Modeling: The Two Cultures, while I prefer to use algorithmic models and treat the data mechanism as unknown. I would add that the underlying data mechanism is deterministic, and we should use collected data to get improved models. Anyway, let鈥檚 use the many resources in the time-series field to get better in this field. ...

August 15, 2023 路 3 min

Endpoint validation

Hi, In my previous job, I spent hours debugging internal data transformations to figure out the received data from an external API was faulty. This issue would not appeared with schema validation. My fault was that I trusted the incoming data and didn鈥檛 check for data consistency. Learning from mistakes and saving time, I would set up a small example for JSON validation via Pydantic. FastAPI relies heavily on pydantic and I use it for validating the incoming request and outgoing response. Anyway, not in every project FastAPI is used. ...

August 7, 2023 路 3 min

Fast data transfer to or from s3

Hi, This post is an homage to a stackoverflow post copying data from s3. This shared work saved me a lot of time. I believe that individuals who share their work do not receive sufficient recognition. The problem is that I have multiple Gb of data separated into thousands of files. Those files are selected for download by the semi-automated pipeline for model training. So the number of files to download varies from pipeline run to pipeline run. Also, this makes any data preparation obsolete. The solution from the official boto3 documentation for copying data from s3 takes too long. Even with asynchronous execution in the download, it will take a few hours to download those files. Imagine a scenario where you want to fine-tune a deep learning model on a machine with multiple GPUs, but you have to wait several hours for the data to be copied 馃槺. Any preprocessing steps are not feasible since the data is filtered upon request. Additionally, downloading the data via aws cli is not an option, as there is much more data in the s3 buckets than requested for model training. The simplest approach is to increase the throughput. And here is the beauty, directly copy+pasted from Pierre D: ...

April 27, 2023 路 3 min