Data Science

Deep Learning model explainability

Hi, In my first post, I looked into the explainability of classical machine learning models. As a next step, I’m interested in the explainability of neural networks. Model explainability is easy for simple models (linear regression, decision trees), and some tools exist for more complex algorithms (ensemble trees). Therefore, I highly recommend the book Interpretable Machine Learning by Christoph Molnar for a deeper theoretical understanding. All different approaches for model explanability are shown with a PyTorch model in this kaggle notebook. ...

Model explainability

Hi, Some months have passed since my last post. Model explainability is easy for simple models (linear regression, decision trees), and some tools exist for more complex algorithms (ensemble trees). I want to dig into the tools to interpret more complex models with this post. Therefore, I highly recommend the book Interpretable Machine Learning by Christoph Molnar for a deeper theoretical understanding. All different approaches for model explanability are shown with a RandomForest model in this kaggle notebook. ...

Cookie-cutter Problems

Hi, Recently, I started to put some scripts together and run them against a Kaggle dataset. I decided to train my skills on an unseen dataset. Training keeps me sharp, and I need it to complement my skill set. For the last 2,5 years, I struggled in a small team with NLP problems, where I worked mostly on engineering tasks. My understanding in this area is not where I wanted to be. And on top, I follow this natural human process called forgetting things. For example, I definitely can’t write all relevant stochiometric formulas of the chemo-lithotrophic denitrification by memory. This was very important during my Ph.D. When did I start to forget relevant information? It’s funny to think back. 🤷 ...

Stratified multi-label split

Hi, This post is a short overview of a stratified multi-label train-test split. Please look at the colab implementation for a step through guide. Sometimes you step into work problems, which justify a small post. I already saw colleagues struggling to balance the train-test split for multi-label classification. In classification problems, we have often a dataset with an imbalanced number of classes. In general, it is desired to keep the proportions of each label for the train and test sets as observed as in the original dataset. This stratified train-test split works well with single-label classification problems. For multi-label classification it is unclear how stratified sampling should be performed. Therefor Sechidis et al. 2011 and Szymanski and Kajdanowicz 2017 developed an algorithm to provide balanced datasets for multi-label classification. The documentation of their algorithm can be found in the scikit-multilearn package and on github. ...