Overfitting, a common problem in Machine Learning, occurs when a predictive model learns the noise in the training data instead of the true underlying patterns and converges to perform very well with the training data but poorly with unseen data. Models that overfit cannot be deployed in practice. Regularization is a technique typically used to help a model better generalize. This is usually achieved by adding a penalty term in the loss function to discourage the model from fitting noise, making it more robust to noise in the data and, therefore more generalizable. One method of regularization is to take some of the concentration (called Smoothing Ratio (SR)) from the data sample’s ground truth label and distribute it uniformly among all the other labels during training. This method is called label smoothing and is a simple yet effective method to improve generalization. In this work, we explore what happens if we distribute the SR to the non-ground truth labels based on how closely they are related to the ground truth label, instead of uniformly. We call this approach of distributing the SR based on relation between labels as Preferential Label Smoothing (PLS). PLS represents a more unified approach of performing label smoothing. Ordinary uniform label smoothing becomes pointless as the number of labels becomes large since the SR proportion distributed per label becomes negligible. PLS is inconsequential in the case of binary classification, since there are only two labels. Therefore, we investigate the effects of PLS when the number of labels in the dataset is high. We also examine the effects of uniform and preferential label smoothing, as well as the absence of label smoothing, on the training dynamics. We demonstrate our study on image classification and text classification.
Article ID: 2023L27
Publisher: Canadian Artificial Intelligence Association