In: Computer Science
a) Provide a comprehensive response describing naive
Bayes?
(b) Explain how naive Bayes is used to filter spam. Please make
sure to explain how this process works.
(c) Explain how naive Bayes is used by insurance companies to
detect potential fraud in the claim process.
Need 700 words discussion
Your assignment should include at least five (5) reputable sources,
written in APA Style, and 500-to-650-words.
Answer:-
a)
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
To start with, let us consider a dataset.
Consider a fictional dataset that describes the weather conditions for playing a game of golf. Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for plaing golf.
b)
One way spam emails are sorted is by using a Naive Bayes classifier. The Naive Bayes algorithm relies on Bayes Rule. This algorithm will classify each object by looking at all of it’s features individually. Bayes Rule below shows us how to calculate the posterior probability for just one feature. The posterior probability of the object is calculated for each feature and then these probabilities are multiplied together to get a final probability. This probability is calculated for the other class as well. Which ever has the greater probability that ultimately determines what class the object is in.
Bayes Rule
For our purposes, the object is an email and the features are the unique words in the email. Thus, there is a posterior probability calculated for each unique word in them email. Plugging this into Bayes Rule, our formula will look something like this:
Now that we understand Naive Bayes, we can create our own spam filter.
Creating A Spam Filter Using Python/Scikit-Learn
Creating your own spam filter is surprisingly very easy. The first step is to get a data set of emails. This can be found on Kaggle and will need to be read into a pandas dataframe. Your dataframe should look something like this:
Sample DataFrame containing emails
In this case, the ‘text’ column contains the message within each email. The ‘label_num’ column has the outcomes for these emails. For this dataset, a 1 represent an email that is ‘spam’ while a 0 represent an email that is not spam or ‘ham’. Besides pandas, you will also need to import the following scikit-learn libraries:
Now that you have your dataset ready, it’s time to train your classifier with just a few lines of code:
In the code above, the first thing I did was create a train-test-split which isn’t necessary to build your classifier. In the next step I use the CountVectorizer() in order to change each email into a vector counting the
c)
irms increasingly view each contact with their customers as an opportunity that needs to be managed. The primary purpose of this article is to gain a better understanding of the customers' post-complaint period. Specific focus is placed on the impact of effective complaint handling on actual customer behavior throughout the time, whereas previous research has mainly focused on time-invariant or intentional measures. Survival analysis techniques are used to investigate the longitudinal behavior of complainants after their problem recovery. The proportionality assumption is tested for each explanatory variable under investigation. In addition, the impact for each variable is estimated by means of survival forests. Survival forests enable us to explore the evolution over time of the effects of the covariates under investigation. As such, the impact of each explanatory variable is allowed to change when the experiment evolves over time, in contrast to ‘proportional’ models that restrict these estimates to be stationary. Our research is performed in the context of a financial services provider and analyses the post-complaint periods of 2326 customers. Our findings indicate that (i) it is interesting to consider complainants since they represent a typical and rather active customer segment, (ii) furthermore, it is beneficiary to invest in complaint handling, since these investments are likely to influence customers' future behavior and (iii) survival forests are a helpful tool to investigate the impact of complaint handling on future customer behavior, since its components provide evidence of changing effects over time.
Insurance frauds cover the range of improper activities which an individual may commit in order to achieve a favorable outcome from the insurance company. This could range from staging the incident, misrepresenting the situation including the relevant actors and the cause of incident and finally the extent of damage caused.
Potential situations could include:
The insurance industry has grappled with the challenge of insurance claim fraud from the very start. On one hand, there is the challenge of impact to customer satisfaction through delayed payouts or prolonged investigation during a period of stress. Additionally, there are costs of investigation and pressure from insurance industry regulators. On the other hand, improper payouts cause a hit to profitability and encourage similar delinquent behavior from other policy holders.
According to FBI, the insurance industry in the USA consists of over 7000 companies that collectively received over $1 trillion annually in premiums. FBI also estimates the total cost of insurance fraud (non-health insurance) to be more than $40 billion annually .
It must be noted that insurance fraud is not a victimless crime – the losses due to frauds, impact all the involved parties through increased premium costs, trust deficit during the claims process and impacts to process efficiency and innovation.
Hence the insurance industry has an urgent need to develop capability that can help identify potential frauds with a high degree of accuracy, so that other claims can be cleared rapidly while identified cases can be scrutinized in detail.
2.0 Why Machine Learning in Fraud Detection?
The traditional approach for fraud detection is based on developing heuristics around fraud indicators. Based on these heuristics, a decision on fraud would be made in one of two ways. In certain scenarios rules would be framed that would define if the case needs to be sent for investigation. In other cases, a checklist would be prepared with scores for the various indicators of fraud. An aggregation of these scores along with the value of the claim would determine if the case needs to be sent for investigation. The criteria for determining indicators and the thresholds will be tested statistically and periodically recalibrated.
The challenge with the above approaches is that they rely very heavily on manual intervention which will lead to the following limitations
These are challenging from a traditional statistics perspective. Hence, insurers have started looking at leveraging machine learning capability. The intent is to present a variety of data to the algorithm without judgement around the relevance of the data elements. Based on identified frauds, the intent is for the machine to develop a model that can be tested on these known frauds through a variety of algorithmic techniques.
3.0 Exercise Objectives
Explore various machine learning techniques to improve accuracy of detection in imbalanced samples. The impact of feature engineering, feature selection and parameter tweaking are explored with objective of achieving superior predictive performance.
As a procedure, the data will be split into three different segments – training, testing and cross-validation. The algorithm will be trained on a partial set of data and parameters tweaked on a testing set. This will be examined for performance on the cross-validation set. The high-performing models will be then tested for various random splits of data to ensure consistency in results.
The exercise was conducted on ApolloTM – Wipro’s Anomaly Detection Platform, which applies a combination of pre-defined rules and predictive machine learning algorithms to identify outliers in data. It is built on Open Source with a library of pre-built algorithms that enable rapid deployment, and can be customized and managed. This Big Data Platform is comprised of three layers as indicated below.
Three layers of Apollo’s architecture
Data Handling:
Detection Layer
Outcomes
The exercise described above was performed on four different insurance datasets. The names cannot be declared, given reasons of confidentiality.
Data descriptions for the datasets are given below.
4.0 Data Set Description
4.I Introduction to Datasets
Table 1: Features of various datasets
4.2 Detailed Description of Datasets
Overall Features:
The insurance dataset can be classified into different categories of details like policy details, claim details, party details, vehicle details, repair details, risk details. Some attributes that are listed in the datasets are: categorical attributes with names: Vehicle Style, Gender, Marital Status, License Type, and Injury Type etc. Date attributes with names: Loss Date, Claim Date, and Police Notified Date etc. Numerical attributes with names: Repair Amount, Sum Insured, Market Value etc.
For better data exploration, the data is divided and explored based on the perspectives of both the insured party and the third party. After doing some Exploratory Data Analysis (EDA) on all the datasets, some key insights are listed below
Dataset – 1:
Dataset – 2:
For Insured:
For Third Party: