A Framework of Combining Data Pre-Processing Methods and Boosting for Software Quality Classification  
Author Taghi M. Khoshgoftaar

 

Co-Author(s) Kehan Gao; Amri Napolitano; Lofton A. Bullard

 

Abstract Feature selection (FS) plays an important role for software quality classification, especially when a training dataset contains too many independent attributes. FS can improve the prediction performance of the predictors, provide faster and more cost-effective predictors, and give a better understanding of the underlying process that generates the model. Class imbalance is a separate problem that is often found in a software measurement dataset, wherein the class ratio is skewed. Data sampling by altering the dataset to change its balance level has been proved an effective method for resolving this problem. Another technique called boosting is found to also be effective for dealing with the class imbalance problem. In this study, we propose a framework that combines FS and a sampled ensemble learning (boosting) approach for improving software quality classification. There are two different scenarios for this combination: FS performed prior to the boosting process and FS performed inside the boosting process. As for FS, we have four options: individual FS, repetitive sampled FS, sampled ensemble FS, and repetitive sampled ensemble FS. In order to validate the effectiveness of the framework and learn the effects of FS as well as the boosting approach on the classification performance, we conducted a case study, applying the proposed framework to three datasets from a real-word software system. Seventeen feature ranking techniques were examined. We also employed a plain learner to construct classification models on the original datasets (no FS used) and on the altered training datasets (FS used) as well; besides, we performed the boosting algorithm with no FS and used all those results as the baselines for further comparison. The results demonstrate that 1) FS is important and necessary prior to a learning process; 2) FS performed inside the boosting process has better performance than FS performed prior to the boosting process; 3) the repetitive sampled FS technique generally has similar or better performance than the individual FS approach; 4) the boosting algorithm shows the same (or similar) effect on the classification models as the plain learner, especially when a repetitive FS technique is used in data preprocessing; and 5) the ensemble has similar or slightly better performance than the median of the base rankers that make up the ensemble.

 

Keywords software quality classification, feature selection, data sampling, boosting
   
    Article #:  21104
 
Proceedings of the 21st ISSAT International Conference on Reliability and Quality in Design
August 6-8, 2015 - Philadelphia, Pennsylvia, U.S.A.