Quality and Reliability of Data-Driven Business Applications: Methodology of Addressing Imbalanced Data Set Biases  
Author Andrei Shcheprov


Co-Author(s) Brady McMicken; Mike Sturdevant; Alan Cordell


Abstract Binary classification algorithms are commonly applied to real-world data-driven business problems that are represented by highly imbalanced data sets. Classifiers built on imbalanced data are often viewed as biased towards the majority class. This bias can significantly impact quality and reliability of data-driven business decisions and outcomes. This paper explores the nature of the bias and its influence on performance of classification methods by comparing models trained on data with different class imbalance levels. It is shown that in practical applications the bias is usually associated with model evaluation metrics and can be significantly reduced if the selected metric is directly linked to the business requirements. The paper also emphasizes the importance of Cost Sensitive Learning.


Keywords machine learning (ML) modeling, classification, data computing, performance analysis, optimization
    Article #:  RQD28-1

Proceedings of 28th ISSAT International Conference on Reliability & Quality in Design
August 3-5, 2023