What is a model?
Model represents a real world scenario with some Epsilon, where Epsilon represents the Error factor.
Y = f(X) + epsilon
What is an Imbalanced Target Variable?
Let us first go through few real time examples:
- Telecom Domain:
In Telecom, subscribers tend to move frequently from one mobile operator to another for better service or offers. This phenomena known as Customer Churn, ranges from 5 to 10%. In order to model this, the entire customer database is coded into 1 ? CHURN customers or 0 ? ACTIVE customers. Since the number of Active customers far outweigh the Churn customers and the distribution of such is also not uniform, the data set is called Imbalanced.
# of Observation |
Target Variable |
||
Target Variable (Binary) |
1 |
50,000 |
1 = CHURN customers |
0 |
10,00,000 |
0 = ACTIVE customers |
- Healthcare Domain:
A multi-specialty Hospital wanted to predict whether a patient is prone to Diabetes now or in the near future. Modern conveniences have resulted in a more sedentary lifestyle globally thus causing an explosion in the rate of Diabetes affliction. Recent studies have shown that close to 92.5% of all the patients were Diabetic or prone to Diabetes, and only 7.5% of the total patients were found to be healthy.
# of Observation |
Target Variable |
||
Target Variable (Binary) |
1 |
925 |
1 = Patients prone to Diabetes |
0 |
075 |
0 = Patients without Diabetes symptoms |
What is a Rare Event?
An event is said to be rare if the number of times it occurs is very minimum or low
In both the scenarios mentioned above ? Telecom & and Healthcare, the management was interested in predicting (modelling) CHURN customers & PATIENTS without Diabetes symptoms. These two events are called RARE EVENTs, since its overall presence is relatively less when compared to the levels of the other TARGET VARIABLE (Y).
How will you statistically evaluate whether the Target Variable is imbalanced / skewed?
Perform a Chi-Square Test using the below command (*here it is being evaluated using R-Open Source software)
Chi-Square Test conducted using R-Software
Patient.Count
Diabetes 925
Without Diabetes 75
Chi-squared test for given probabilities
Null Hypothesis : Data is uniformly distributed
Alternative Hypothesis: Data is not uniformly distributed
data: Clinical.Test[, 1]
X-squared = 722.5, df = 1, p-value < 0.00000000000000022
Chi-Square Test conducted using Minitab
Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: Count
Using category names in Disease
Category | Observed | Test Proportion | Expected | Contribution to Chi-Sq |
---|---|---|---|---|
Y | 925 | 0.5 | 500 | 361.25 |
N | 75 | 0.5 | 500 | 361.25 |
N DF Chi-Sq P-Value
1000 1 722.5 0.000
As the ?p-value? < 0.05 (*which is commonly chosen Alpha value) we can Reject Null Hypothesis and conclude that ?Data is not uniformly distributed?
How to overcome this problem?
This problem can be overcome by two main methods:
- Sampling methods
ü Over Sampling techniques
ü Under Sampling techniques
- Algorithms
ü Penalized Likelihood Algorithms
Disclaimer:
This blog provides a Macro Level explanation on Imbalanced Targets (Y). It is very important to employ sound countermeasures against imbalanced targets prior to any modeling activity.
Detailed blog on OVER SAMPLING will be published next.
Comments