Software Quality Analysis with Distribution Bias in Defect Data | ||||
Author | Naeem Seliya
|
|||
Co-Author(s) | Taghi M. Khoshgoftaar
|
|||
Abstract | Software quality improvement activities often include building defect prediction models, which provide a more-focussed allocation of resources toward low-quality program modules. Very often software metrics and defect data sets are biased with respect to class distribution, i.e., the proportion of not-fault-prone program modules is substantially higher than the proportion of fault-prone modules. Class distribution bias in software defect data leads to poor classification performance. Data Sampling, Boosting, and Bagging are useful techniques for alleviating this problem. In this study we compare RUSBoost (Random Under-sampling and Boosting) and RBBag (Roughly Balanced Bagging), two techniques proposed by our team for handling class-distribution bias (or class imbalance) in binary classification problems. RUSBoost combines Data Sampling and Boosting for alleviating the class imbalance problem, while RBBag combines Data Sampling with Bagging for alleviating the class imbalance problem. An empirical software engineering case study consisting of 15 software metrics and defect data sets from several real-world software projects is used to investigate the relative effectiveness of RUSBoost and RBBag for handling software defect data sets with class-distribution bias. It is shown that RBBag generally performs better than RUSBoost, in the context of our case study. Moreover, both RUSBoost and RBBag outperform defect prediction models built without any Data Sampling, Boosting, or Bagging. Our study recommends considering class-distribution bias during the modeling and building of software defect prediction models. Applying effective methods to alleviate the class-distribution bias is an important step in the proper handling of the affected data set, thereby reducing the negative effect on the classification performance.
|
|||
Keywords | defect prediction, software metrics, skewed data, machine learning, data sampling, boosting, bagging | |||
Article #: 19103 |
August 5-7, 2013 - Honolulu, Hawaii, U.S.A. |