Evaluating Undersampling and Thresholding with Highly Imbalanced Big Data  
Author Justin M. Johnson

 

Co-Author(s) Taghi M. Khoshgoftaar

 

Abstract There are a variety of data-level and algorithmlevel techniques for addressing the challenges associated with classifying imbalanced data. From the family of data-level techniques, random undersampling (RUS) is a popular choice for big data problems because it simultaneously balances the training distribution and reduces resource requirements. Output thresholding is another popular technique that improves classification performance by tuning the decision threshold that is used to assign class labels to class probabilities. This study explores the interaction between RUS and output thresholding across a range of class distributions using a highly imbalanced big data set from the Centers for Medicare and Medicaid Services. RUS is used to reduce the size of the majority class and create 10 positive class ratios ranging from the original baseline of 0.0004 to a balanced ratio of 0.5. The performance of each distribution is evaluated using Random Forest and Extreme Gradient Boosting learners, multiple complementary performance metrics, and statistical analysis. Our results show that RUS does not improve model performance. Instead, we find that selecting an appropriate decision threshold is all that is required to maximize all performance metrics. We also demonstrate the importance of performance metric selection and highlight the area under the receiver operating characteristic curve as a misleading metric in this highly imbalanced big data context. Our contributions include our evaluation of RUS, output thresholding, and performance metric selection in the context of both big and highly imbalanced data.

 

Keywords Class Imbalance, Big Data, Ensemble Learners, Healthcare, Medicare, Fraud Detection
   
    Article #:  RQD27-72
 

Proceedings of 27th ISSAT International Conference on Reliability & Quality in Design
Virtual Event

August 4-6, 2022