Similarity of Wrapper-based Feature Subset Selection Methods on Software Engineering Data

Huanjing Wang

Similarity of Wrapper-based Feature Subset Selection Methods on Software Engineering Data
Author	Huanjing Wang
Co-Author(s)	Taghi M. Khoshgoftaar; Amri Napolitano
Abstract	Recently, wrapper-based feature (software metric) subset selection techniques have been used as part of the software quality modeling process. However, it is not clear how much the different parameters of wrapper-based feature selection (such as the choice of wrapper learner and wrapper performance metric) affect which features are chosen. To study how these two choices can affect the feature selection process, we test five different learners and five different performance metrics within the wrapper and then use our newly proposed Average Pairwise Tanimoto Index (APTI) to evaluate the similarity between techniques which share either a learner or a metric in common. Three software metric datasets from a real-world software project are used in this study. Results demonstrate that Best Arithmetic Mean (BAM) and Best Geometric Mean (BGM) metrics exhibit most similarity regardless of learners; in addition, Overall Accuracy (OA) is least similar to each of the other metrics (Area Under the ROC Curve (AUC), Area Under the Precision-Recall Curve (PRC), BAM, and BGM) when considered individually with each. The five learners were also found to produce very low amounts of similarity. Thus, we show that the choice of both learner and performance metric has a major effect on which features are chosen by wrapperbased feature subset selection.
Keywords	feature subset selection, software measurements, wrappers, similarity

		Article #: 20223

Proceedings of the 20th ISSAT International Conference on Reliability and Quality in Design
August 7-9, 2014 - Seattle, Washington, U.S.A.

	International Society of Science and Applied Technologies