Scoring functions for Bayesian network (BN) structure learning can conflict in their rankings and previous work has empirically studied their effectiveness with an aim to provide recommendations on their use. However, previous studies on scoring functions are limited by the small number and scale of the instances used in the evaluation and by a focus on learning a single network. Often, a better alternative to committing to a single network is to learn multiple networks and perform model averaging as this method provides confidence measures for knowledge discovery and improved accuracy for density estimation. In this paper, we empirically study a selection of widely used and also recently proposed scoring functions. We address design limitations of previous empirical studies by scaling our experiments to larger BNs, comparing on an extensive set of both ground truth BNs and real-world datasets, considering alternative performance metrics, and comparing scoring functions on two model averaging frameworks: the bootstrap and the credible set. Contrary to previous recommendations based on finding a single structure, we find that for model averaging the BDeu scoring function is the preferred choice in most scenarios for the bootstrap framework and a recent score called quotient normalized maximum likelihood (qNML) is the preferred choice for the credible set framework.
Article ID: 2022L15
Publisher: Canadian Artificial Intelligence Association