Proper calibration of deep-learning models is critical for many high-stakes problems. In this paper, we show that existing calibration metrics fail to pay attention to miscalibration on individual classes, hence overlooking minority classes and causing significant issues on imbalanced classification problems. Using a COVID-19 hate-speech dataset, we first discover that in imbalanced datasets, miscalibration error on an individual class varies greatly, and error on minority classes can be magnitude times worse than what is suggested by the overall calibration performance. To address this issue, we propose a new metric based on expected miscalibration error, named as Contraharmonic Expected Calibration Error (CECE), which punishes severe miscalibration on individual classes. We further devise a novel variant of temperature scaling for imbalanced data to improve class-wise miscalibration, which re-weights the loss function by the inverse class count to tune the scaling parameter to reduce worst-case minority calibration error. Our case study on a benchmarking COVID-19 hate speech task shows the effectiveness of our calibration metric and our temperature scaling strategy.
Article ID: 2021L24
Publisher: Canadian Artificial Intelligence Association