Hi,
This post is a short overview of a stratified multi-label train-test split. Please look at the colab implementation for a step through guide.
Sometimes you step into work problems, which justify a small post. I already saw colleagues struggling to balance the train-test split for multi-label classification. In classification problems, we have often a dataset with an imbalanced number of classes. In general, it is desired to keep the proportions of each label for the train and test sets as observed as in the original dataset. This stratified train-test split works well with single-label classification problems. For multi-label classification it is unclear how stratified sampling should be performed. Therefor Sechidis et al. 2011 and Szymanski and Kajdanowicz 2017 developed an algorithm to provide balanced datasets for multi-label classification. The documentation of their algorithm can be found in the scikit-multilearn package and on github.
Stratified multi-label split
Please have a look at the colab notebook. For the rest of the post, I follow the package tutorial based on this multi-label dataset Boutell et al. 2004. The dataset comes with an initial split between train and test.
|
|
The original train and test size is 1211 and 1196 data points. The label distribution looks like this:
n label | (0, 0) | (0, 3) | (0, 4) | (0, 5) | (1, 1) | (2, 2) | (2, 3) | (2, 4) | (3, 3) | (3, 4) | (3, 5) | (4, 4) | (4, 5) | (5, 5) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 227 | 0 | 21 | 12 | 165 | 197 | 8 | 6 | 196 | 27 | 1 | 277 | 1 | 224 |
test | 200 | 1 | 17 | 7 | 199 | 200 | 16 | 8 | 237 | 49 | 5 | 256 | 0 | 207 |
It is obvious, that some of the labels are imbalanced. Especially label 1 (1,1) and label 3 (3,3) are skewed. For a comparison, let’s have a look, how the iterative train test split improves this problem.
|
|
The original train and test size is 1204 and 1203 data points. The label distribution looks like this:
n label | (0, 0) | (0, 3) | (0, 4) | (0, 5) | (1, 1) | (2, 2) | (2, 3) | (2, 4) | (3, 3) | (3, 4) | (3, 5) | (4, 4) | (4, 5) | (5, 5) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 214 | 0 | 19 | 9 | 182 | 198 | 12 | 7 | 217 | 38 | 3 | 266 | 1 | 215 |
test | 213 | 1 | 19 | 10 | 182 | 198 | 12 | 7 | 216 | 38 | 3 | 267 | 0 | 216 |
It is obvious, that this split is balanced or stratified.
Thank you for your attention.