In: Statistics and Probability
Implementing a Naïve Bayes classifier on below data :
Please provide full explanation
CustID | Gender | SeniorCitizen | Married | AnyDependents | NoOfYrsCustomer | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | ContractType | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | SwitchToCompetitor |
1 | F | 0 | Y | N | 1 | N | No phone | DSL | N | Y | N | N | N | N | Monthly | Y | Electronic check | 29.85 | 29.85 | No |
2 | M | 0 | N | N | 34 | Y | N | DSL | Y | N | Y | N | N | N | One year | N | Mailed check | 56.95 | 1889.5 | No |
3 | M | 0 | N | N | 2 | Y | N | DSL | Y | Y | N | N | N | N | Monthly | Y | Mailed check | 53.85 | 108.15 | Yes |
4 | M | 0 | N | N | 45 | N | No phone | DSL | Y | N | Y | Y | N | N | One year | N | Bank transfer (automatic) | 42.3 | 1840.75 | No |
Naive Bayes Classifier:
This is a machine learning model used for classification tasks. It is a simple probabilistic classifier built on the principle of bayes theorem. The main assumption of Naive bayes classifier is independence between the variables.
Let us now have a brief look into its base concept i.e., Bayes Theorem: It says that probability of a certain event is based on prior knowledge which is related to the event. For example: Probability of death by coranavirus increases having a history of respiratory diseases.
The mathematical formula for the same is as follows:
Probability of A given B is equal to probability of B given A multiplied by probability of A whole divided by probability of B
Lets look at the given data for a while, we are expected to classify whethere a customer will switch to cmpetitor or not given the predictors like gender, senior citizen flag, married flag, any dependents, no_of_years_customer, phoneservice, multiple_lines, internet_service, online_security, online_backup, device_protection, tech_support, streaming_tv, streaming_movies, contract_type, paperless_billing, payment_method, monthly_charges, total_charges.
columns are predictors and rows are observations. we have 20 columns and 4 observations in the dataset. If we see the first row customer will not swtich to the comepetitor if the gender is Female, if not senior citizen, if married, if doesn't have dependents, no_of_yrs_customer is 1 like that we can interpret data.
As mentioned in the beginning the main assumption which we have to make is the variables are not independent. Now lets rewrite the function as per our problem.
We have to find probability of dependent variable that is whether customer will switch to competitor given the data of independent variables. Let us say y as dependent variable and Xn as indpendent variables or number of independent features. We have a total of 19 features which will be used to predict or find probability of customer switching to the competitor.
By substituting the values from the given data we can find the probability of customer switching to the competitor. You need to calculate each of the probability like in the dataset probability of switching to competitor say P(y) = 1/4 = 0.25 similarly probability of not swtiching to competitor is p(not switching) = 3/4 = 0.75. Like this if we calculate probability of each of the event and substitute in the above formula we will be getting the dependent variable whether customer will switch or not switch to the competitor.
There are different types of Navie Bayes classifiers like Multinomial Naive Bayes, Bernouli Naive Bayes, Guassian Naive Bayes each is used for different purposes.