Similarity of Wrapper-based Feature Subset Selection Methods on Software Engineering Data  
Author Huanjing Wang

 

Co-Author(s) Taghi M. Khoshgoftaar; Amri Napolitano

 

Abstract Recently, wrapper-based feature (software metric) subset selection techniques have been used as part of the software quality modeling process. However, it is not clear how much the different parameters of wrapper-based feature selection (such as the choice of wrapper learner and wrapper performance metric) affect which features are chosen. To study how these two choices can affect the feature selection process, we test five different learners and five different performance metrics within the wrapper and then use our newly proposed Average Pairwise Tanimoto Index (APTI) to evaluate the similarity between techniques which share either a learner or a metric in common. Three software metric datasets from a real-world software project are used in this study. Results demonstrate that Best Arithmetic Mean (BAM) and Best Geometric Mean (BGM) metrics exhibit most similarity regardless of learners; in addition, Overall Accuracy (OA) is least similar to each of the other metrics (Area Under the ROC Curve (AUC), Area Under the Precision-Recall Curve (PRC), BAM, and BGM) when considered individually with each. The five learners were also found to produce very low amounts of similarity. Thus, we show that the choice of both learner and performance metric has a major effect on which features are chosen by wrapperbased feature subset selection.

 

Keywords feature subset selection, software measurements, wrappers, similarity
   
    Article #:  20223
 
Proceedings of the 20th ISSAT International Conference on Reliability and Quality in Design
August 7-9, 2014 - Seattle, Washington, U.S.A.