{"id":202,"date":"2015-04-16T05:49:43","date_gmt":"2015-04-16T05:49:43","guid":{"rendered":"http:\/\/localhost:3000\/serendiosite\/?p=202"},"modified":"2015-09-28T09:23:19","modified_gmt":"2015-09-28T09:23:19","slug":"modelling-imbalanced-target-variable","status":"publish","type":"post","link":"https:\/\/serendio.com\/modelling-imbalanced-target-variable\/","title":{"rendered":"Modelling Imbalanced Target Variable"},"content":{"rendered":"<h2><b>What is a model?<\/b><\/h2>\n<p>Model represents a real world scenario with some Epsilon, where Epsilon represents the Error factor.<\/p>\n<p>Y = f(X) + epsilon<\/p>\n<h3><b>What is an Imbalanced Target Variable?<\/b><\/h3>\n<p>Let us first go through few real time examples:<\/p>\n<ul>\n<li><b>Telecom Domain:<\/b><\/li>\n<\/ul>\n<p>In Telecom, subscribers tend to move frequently from one mobile operator to another\u00a0for better service or offers. This phenomena known as Customer Churn, ranges from 5 to 10%. In order to model this, the entire customer database is coded into 1 \u2013 CHURN customers or 0 \u2013 ACTIVE customers.\u00a0Since the number of Active customers far outweigh the Churn customers and the distribution of such is also not uniform, the data set is called Imbalanced.<\/p>\n<table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" class=\"omsc-custom-table omsc-style-1\">\n<tbody>\n<tr>\n<td width=\"174\"><\/td>\n<td width=\"60\"><\/td>\n<td width=\"142\">\n<p align=\"center\"># of Observation<\/p>\n<\/td>\n<td width=\"240\">\n<p align=\"center\">Target Variable<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td rowspan=\"2\" width=\"174\">\n<p align=\"center\">Target Variable (Binary)<\/p>\n<\/td>\n<td width=\"60\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td width=\"142\">\n<p align=\"center\">50,000<\/p>\n<\/td>\n<td width=\"240\">1 = CHURN customers<\/td>\n<\/tr>\n<tr>\n<td width=\"60\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td width=\"142\">\n<p align=\"center\">10,00,000<\/p>\n<\/td>\n<td width=\"240\">0 = ACTIVE customers<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<ul>\n<li><b>Healthcare<\/b> <b>Domain:<\/b><\/li>\n<\/ul>\n<p>A multi-specialty Hospital wanted to predict whether a patient is prone to Diabetes now or in the near future.\u00a0 Modern conveniences have resulted in a more sedentary lifestyle globally thus causing an explosion in the rate of Diabetes affliction. Recent studies have shown that close to 92.5% of all the patients were\u00a0Diabetic or prone to Diabetes, and only 7.5% of the total patients were found to be healthy.<\/p>\n<table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" class=\"omsc-custom-table omsc-style-1\">\n<tbody>\n<tr>\n<td width=\"174\"><\/td>\n<td width=\"60\"><\/td>\n<td width=\"142\">\n<p align=\"center\"># of Observation<\/p>\n<\/td>\n<td width=\"240\">\n<p align=\"center\">Target Variable<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td rowspan=\"2\" width=\"174\">\n<p align=\"center\">Target Variable (Binary)<\/p>\n<\/td>\n<td width=\"60\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td width=\"142\">\n<p align=\"center\">925<\/p>\n<\/td>\n<td width=\"240\">1 = Patients prone to Diabetes<\/td>\n<\/tr>\n<tr>\n<td width=\"60\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td width=\"142\">\n<p align=\"center\">075<\/p>\n<\/td>\n<td width=\"240\">0 = Patients without Diabetes symptoms<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3><b>What is a Rare Event?<\/b><\/h3>\n<p>An event is said to be rare if the number of times it occurs is very minimum or low<\/p>\n<p>In both the scenarios mentioned above \u2013 Telecom &amp; and Healthcare, the management was interested in predicting (modelling) CHURN customers &amp; PATIENTS without Diabetes symptoms.\u00a0 These two events are called RARE EVENTs, since its overall presence is relatively less when compared to the levels of the other TARGET VARIABLE (Y).<\/p>\n<h3><b>How will you statistically evaluate whether the Target Variable is imbalanced \/ skewed?<\/b><\/h3>\n<p>Perform a Chi-Square Test using the below command (*here it is being evaluated using R-Open Source software)<\/p>\n<p><b>Chi-Square Test conducted using R-Software<\/b><\/p>\n<p>Patient.Count<\/p>\n<p>Diabetes\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 925<\/p>\n<p>Without Diabetes\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 75<\/p>\n<p>Chi-squared test for given probabilities<\/p>\n<p>Null Hypothesis : Data is uniformly distributed<\/p>\n<p>Alternative Hypothesis: Data is not uniformly distributed<\/p>\n<p>data:\u00a0 Clinical.Test[, 1]<\/p>\n<p>X-squared = 722.5, df = 1, p-value &lt; 0.00000000000000022<\/p>\n<p><b>\u00a0<\/b><\/p>\n<p><b>Chi-Square Test conducted using Minitab<\/b><\/p>\n<p><b>Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: Count<\/b><\/p>\n<p>&nbsp;<\/p>\n<h3>Using category names in Disease<\/h3>\n<table class=\"omsc-custom-table omsc-style-1\">\n<tbody>\n<tr>\n<th>Category<\/th>\n<th>Observed<\/th>\n<th>Test Proportion<\/th>\n<th>Expected<\/th>\n<th>Contribution to Chi-Sq<\/th>\n<\/tr>\n<tr>\n<td>Y<\/td>\n<td>925<\/td>\n<td>0.5<\/td>\n<td>500<\/td>\n<td>361.25<\/td>\n<\/tr>\n<tr>\n<td>N<\/td>\n<td>75<\/td>\n<td>0.5<\/td>\n<td>500<\/td>\n<td>361.25<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>N\u00a0 DF\u00a0 Chi-Sq\u00a0 P-Value<\/p>\n<p>1000\u00a0\u00a0 1\u00a0\u00a0 722.5\u00a0\u00a0\u00a0 0.000<\/p>\n<p>&nbsp;<\/p>\n<p>As the \u2018p-value\u2019 &lt; 0.05 (*which is commonly chosen Alpha value) we can Reject Null Hypothesis and conclude that \u2018Data is not uniformly distributed\u2019<\/p>\n<p><b>How to overcome this problem?<\/b><\/p>\n<p>This problem can be overcome by two main methods:<\/p>\n<ul>\n<li>Sampling methods<\/li>\n<\/ul>\n<p>\u00fc\u00a0 Over Sampling techniques<\/p>\n<p>\u00fc\u00a0 Under Sampling techniques<\/p>\n<ul>\n<li>Algorithms<\/li>\n<\/ul>\n<p>\u00fc\u00a0 Penalized Likelihood Algorithms<\/p>\n<p><b>Disclaimer:<\/b><\/p>\n<p><b>This blog provides a Macro Level explanation on Imbalanced Targets (Y).\u00a0 It is very important to employ sound countermeasures against imbalanced targets prior to any modeling activity.<\/b><\/p>\n<p><b>Detailed blog on OVER SAMPLING will be published next.<\/b><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What is a model? Model represents a real world scenario with some Epsilon, where Epsilon represents the Error factor. Y = f(X) + epsilon What&hellip;<\/p>\n<p><a href=\"https:\/\/serendio.com\/modelling-imbalanced-target-variable\/\" class=\"more-link post-excerpt-readmore\">Read more<\/a><\/p>\n","protected":false},"author":31170,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-202","post","type-post","status-publish","format-standard","hentry","category-myblog"],"_links":{"self":[{"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/posts\/202","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/users\/31170"}],"replies":[{"embeddable":true,"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/comments?post=202"}],"version-history":[{"count":4,"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/posts\/202\/revisions"}],"predecessor-version":[{"id":424,"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/posts\/202\/revisions\/424"}],"wp:attachment":[{"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/media?parent=202"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/categories?post=202"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/serendio.com\/wp-json\/wp\/v2\/tags?post=202"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}