Modelling Imbalanced Target Variable

April 16, 2015 by jessica sky myblog 0

What is a model?

Model represents a real world scenario with some Epsilon, where Epsilon represents the Error factor.

Y = f(X) + epsilon

What is an Imbalanced Target Variable?

Let us first go through few real time examples:

Telecom Domain:

In Telecom, subscribers tend to move frequently from one mobile operator to another for better service or offers. This phenomena known as Customer Churn, ranges from 5 to 10%. In order to model this, the entire customer database is coded into 1 ? CHURN customers or 0 ? ACTIVE customers. Since the number of Active customers far outweigh the Churn customers and the distribution of such is also not uniform, the data set is called Imbalanced.

		# of Observation	Target Variable
Target Variable (Binary)	1	50,000	1 = CHURN customers
Target Variable (Binary)	0	10,00,000	0 = ACTIVE customers

Healthcare Domain:

A multi-specialty Hospital wanted to predict whether a patient is prone to Diabetes now or in the near future. Modern conveniences have resulted in a more sedentary lifestyle globally thus causing an explosion in the rate of Diabetes affliction. Recent studies have shown that close to 92.5% of all the patients were Diabetic or prone to Diabetes, and only 7.5% of the total patients were found to be healthy.

		# of Observation	Target Variable
Target Variable (Binary)	1	925	1 = Patients prone to Diabetes
Target Variable (Binary)	0	075	0 = Patients without Diabetes symptoms

What is a Rare Event?

An event is said to be rare if the number of times it occurs is very minimum or low

In both the scenarios mentioned above ? Telecom & and Healthcare, the management was interested in predicting (modelling) CHURN customers & PATIENTS without Diabetes symptoms. These two events are called RARE EVENTs, since its overall presence is relatively less when compared to the levels of the other TARGET VARIABLE (Y).

How will you statistically evaluate whether the Target Variable is imbalanced / skewed?

Perform a Chi-Square Test using the below command (*here it is being evaluated using R-Open Source software)

Chi-Square Test conducted using R-Software

Patient.Count

Diabetes 925

Without Diabetes 75

Chi-squared test for given probabilities

Null Hypothesis : Data is uniformly distributed

Alternative Hypothesis: Data is not uniformly distributed

data: Clinical.Test[, 1]

X-squared = 722.5, df = 1, p-value < 0.00000000000000022

Chi-Square Test conducted using Minitab

Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: Count

Using category names in Disease

Category	Observed	Test Proportion	Expected	Contribution to Chi-Sq
Y	925	0.5	500	361.25
N	75	0.5	500	361.25

N DF Chi-Sq P-Value

1000 1 722.5 0.000

As the ?p-value? < 0.05 (*which is commonly chosen Alpha value) we can Reject Null Hypothesis and conclude that ?Data is not uniformly distributed?

How to overcome this problem?

This problem can be overcome by two main methods:

Sampling methods

ü Over Sampling techniques

ü Under Sampling techniques

Algorithms

ü Penalized Likelihood Algorithms

Disclaimer:

This blog provides a Macro Level explanation on Imbalanced Targets (Y). It is very important to employ sound countermeasures against imbalanced targets prior to any modeling activity.

Detailed blog on OVER SAMPLING will be published next.

What is a model?

What is an Imbalanced Target Variable?

What is a Rare Event?

How will you statistically evaluate whether the Target Variable is imbalanced / skewed?

Using category names in Disease

Comments

Leave a Reply Cancel reply