A metric that works well with highly imbalanced datasets
Found to be a challenging dataset for classification algorithms
It is an 11-dimensional dataset with 25K samples for training and over 1M samples for testing. Each dataset instance is a 5-cards poker-hand that uses two features per card (suite and rank) and the Poker-hand label.
It has two properties that makes it particular challenging for classification algorithms: it’s all categorical features and it’s extremely imbalanced. Categorical features are hard because the typical distance (a.k.a. similarity) metrics can’t be naturally applied to such features. E.g. this dataset has two features: rank and suite, calculating the Euclidean distance between “spades” and “hearts” simply doesn’t make sense. Imbalanced datasets are hard because the machine learning algorithms kind of assume a good balance, Jason Brownlee from Machine Learning Mastery describes the problem as:
Imbalanced classifications pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class
So, if this dataset is supposedly hard, why can a simple neural-network achieve over 90% accuracy without any particular tuning or data pre-processing?
In this article, Aditya Bhardwaj shows that his neural-network achieved 90.04 accuracy. Below, I show that a simple Multi-layer Perceptron Neural-Network achieves over 99% accuracy. One reason this will happen is due to the Class Imbalance Problem. [Ling et al., 2011] explain:
Data are said to suffer the Class Imbalance Problem when the class distributions are highly imbalanced. In this context, many classification learning algorithms have low predictive accuracy for the infrequent class.
The Poker-hand dataset happens to be extremely imbalanced, with the first two classes representing 90% of the samples, in both the training and testing set. A classifier that learns how to classify correctly these two-classes, but completely miss-classifies the remaining classes, will still achieve 90% accuracy in predictions. This is not a good classifier!. The reason the classifier still receives a good score is simply because the class imbalance is taking into account, i.e. the correct predictions of the dominant-classes are given weight proportional to the number of samples. The “low predictive accuracy for the infrequent class” is shadowed by the better predictions from those classes where there are lots of samples to learn from.
A metric that doesn’t take class-imbalance into account, e.g. gives equal weight to all classes regardless of their dominance, can provide more “real” or accurate results. Scikit-learn’s Classification Reports have one such metric. The F1 score combines the results from both precision and recall, and the Classification Report includes a F1 macro-average metric, i.e. unweighted F1-score average per label! . As mentioned in Scikit-learn’s documentation about the F-metrics:
In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance
This is a good example of a metric that can be used to measure the performance of the classifier against this highly imbalanced dataset.
Many other methods and metrics have been proposed in the Machine Learning Literature to deal with some of the problems mentioned here. E.g., Boriah et al discusses some of the existing methods for handling Categorical features in their paper “Similarity Measures for Categorical Data: A Comparative Evaluation”. Discussing this is not the scope of this post and hence I will simply leave you with a link to the paper here.
I went ahead and run a Multi-layer Perceptron Neural Network and here the results I obtained. This network uses 3 hidden layers of 100 neurons each, with alpha=0.0001 and learning rate=0.01. The following is the Confusion Matrix. It can be observed that the neural-network did a good job overall, correctly classifying most of the first 6 classes, with some particular bad results for classes 7 and 9 (Four of a kind and Royal flush.
The accuracy reported for this classifier is 99%. Even though classes 7 and 9 did very bad, they only contribute 233 samples out of 1M samples tested. The bad results from a couple of non-dominant classes are completely shadowed by the other classes. This clearly gives a false impression of success! The neural-network miss-classified 66% and 77% of the Royal-flush and The Four-of-a-kind hands, yet it gets a correct but misleading 99% accuracy result.
The Classification Report shown below includes the previously mentioned macro-average F1 score for all classes. This unweighted mean provides a much better overview of how well the classifier did. It can be seen that most classes did pretty good actually, but there were that couple of classes that did particularly bad. But more importantly, it can be observed that the macro-average reported is 78%. This is a more appropriate score for the results observed! 2/10 classes did poorly, while others did much better, and it is reflected in the metric, when the metric is chosen carefully.
precision recall f1-score support 0 1.00 0.99 0.99 501209
1 0.99 0.99 0.99 422498
2 0.96 1.00 0.98 47622
3 0.99 0.99 0.99 21121
4 0.85 0.64 0.73 3885
5 0.97 0.99 0.98 1996
6 0.77 0.98 0.86 1424
7 0.70 0.23 0.35 230
8 1.00 0.83 0.91 12
9 0.04 0.33 0.07 3 accuracy 0.99 1000000
macro avg 0.83 0.80 0.78 1000000
weighted avg 0.99 0.99 0.99 1000000
 Cattral, R. and Oppacher, F (2007). Poker Hand Data Set [https://archive.ics.uci.edu/ml/datasets/Poker+Hand]
Carleton University, Department of Computer Science.
Intelligent Systems Research Unit
 Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
 Ling C.X., Sheng V.S. (2011) Class Imbalance Problem. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_110