Question

In: Computer Science

a) Provide a comprehensive response describing naive Bayes? (b) Explain how naive Bayes is used to...

a) Provide a comprehensive response describing naive Bayes?

(b) Explain how naive Bayes is used to filter spam. Please make sure to explain how this process works.

(c) Explain how naive Bayes is used by insurance companies to detect potential fraud in the claim process.

Need 700 words discussion

Your assignment should include at least five (5) reputable sources, written in APA Style, and 500-to-650-words.

Solutions

Expert Solution

Answer:-

a)

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

To start with, let us consider a dataset.

Consider a fictional dataset that describes the weather conditions for playing a game of golf. Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for plaing golf.

b)

One way spam emails are sorted is by using a Naive Bayes classifier. The Naive Bayes algorithm relies on Bayes Rule. This algorithm will classify each object by looking at all of it’s features individually. Bayes Rule below shows us how to calculate the posterior probability for just one feature. The posterior probability of the object is calculated for each feature and then these probabilities are multiplied together to get a final probability. This probability is calculated for the other class as well. Which ever has the greater probability that ultimately determines what class the object is in.

Bayes Rule

For our purposes, the object is an email and the features are the unique words in the email. Thus, there is a posterior probability calculated for each unique word in them email. Plugging this into Bayes Rule, our formula will look something like this:

Now that we understand Naive Bayes, we can create our own spam filter.

Creating A Spam Filter Using Python/Scikit-Learn

Creating your own spam filter is surprisingly very easy. The first step is to get a data set of emails. This can be found on Kaggle and will need to be read into a pandas dataframe. Your dataframe should look something like this:

Sample DataFrame containing emails

In this case, the ‘text’ column contains the message within each email. The ‘label_num’ column has the outcomes for these emails. For this dataset, a 1 represent an email that is ‘spam’ while a 0 represent an email that is not spam or ‘ham’. Besides pandas, you will also need to import the following scikit-learn libraries:

Now that you have your dataset ready, it’s time to train your classifier with just a few lines of code:

In the code above, the first thing I did was create a train-test-split which isn’t necessary to build your classifier. In the next step I use the CountVectorizer() in order to change each email into a vector counting the

c)

irms increasingly view each contact with their customers as an opportunity that needs to be managed. The primary purpose of this article is to gain a better understanding of the customers' post-complaint period. Specific focus is placed on the impact of effective complaint handling on actual customer behavior throughout the time, whereas previous research has mainly focused on time-invariant or intentional measures. Survival analysis techniques are used to investigate the longitudinal behavior of complainants after their problem recovery. The proportionality assumption is tested for each explanatory variable under investigation. In addition, the impact for each variable is estimated by means of survival forests. Survival forests enable us to explore the evolution over time of the effects of the covariates under investigation. As such, the impact of each explanatory variable is allowed to change when the experiment evolves over time, in contrast to ‘proportional’ models that restrict these estimates to be stationary. Our research is performed in the context of a financial services provider and analyses the post-complaint periods of 2326 customers. Our findings indicate that (i) it is interesting to consider complainants since they represent a typical and rather active customer segment, (ii) furthermore, it is beneficiary to invest in complaint handling, since these investments are likely to influence customers' future behavior and (iii) survival forests are a helpful tool to investigate the impact of complaint handling on future customer behavior, since its components provide evidence of changing effects over time.

Insurance frauds cover the range of improper activities which an individual may commit in order to achieve a favorable outcome from the insurance company. This could range from staging the incident, misrepresenting the situation including the relevant actors and the cause of incident and finally the extent of damage caused.

Potential situations could include:

  • Covering-up for a situation that wasn’t covered under insurance (e.g. drunk driving, performing risky acts, illegal activities etc.)
  • Misrepresenting the context of the incident: This could include transferring the blame to incidents where the insured party is to blame, failure to take agreed upon safety measures
  • Infiating the impact of the incident: Increasing the estimate of loss incurred either through addition of unrelated losses (faking losses) or attributing increased cost to the losses

The insurance industry has grappled with the challenge of insurance claim fraud from the very start. On one hand, there is the challenge of impact to customer satisfaction through delayed payouts or prolonged investigation during a period of stress. Additionally, there are costs of investigation and pressure from insurance industry regulators. On the other hand, improper payouts cause a hit to profitability and encourage similar delinquent behavior from other policy holders.

According to FBI, the insurance industry in the USA consists of over 7000 companies that collectively received over $1 trillion annually in premiums. FBI also estimates the total cost of insurance fraud (non-health insurance) to be more than $40 billion annually .

It must be noted that insurance fraud is not a victimless crime – the losses due to frauds, impact all the involved parties through increased premium costs, trust deficit during the claims process and impacts to process efficiency and innovation.

Hence the insurance industry has an urgent need to develop capability  that can help identify potential frauds with a high degree of accuracy, so that other claims can be cleared rapidly while identified cases can be scrutinized in detail.

2.0 Why Machine Learning in Fraud Detection?

The traditional approach for fraud detection is based on developing heuristics around fraud indicators. Based on these heuristics, a decision on fraud would be made in one of two ways. In certain scenarios rules would be framed that would define if the case needs to be sent for investigation. In other cases, a checklist would be prepared with scores for the various indicators of fraud. An aggregation of these scores along with the value of the claim would determine if the case needs to be sent for investigation. The criteria for determining indicators and the thresholds will be tested statistically and periodically recalibrated.

The challenge with the above approaches is that they rely very heavily on manual intervention which will lead to the following limitations

  • Constrained to operate with a limited set of known parameters based on heuristic knowledge – while being aware that some of the other attributes could also infiuence decisions
  • Inability to understand context-specific relationships between parameters (geography, customer segment, insurance sales process) that might not refiect the typical picture. Consultations with industry experts indicate that there is no ‘typical model’, and hence challenges to determine the model specific to context
  • Recalibration of model is a manual exercise that has to be conducted periodically to refiect changing behavior and to ensure that the model adapts to feedback from investigations. The ability to conduct this calibration is challenging
  • Incidence of fraud (as a percentage of the overall claims) is lowtypically less than 1% of the claims are classified. Additionally new modus operandi for fraud needs to be uncovered on a proactive basis

These are challenging from a traditional statistics perspective. Hence, insurers have started looking at leveraging machine learning capability. The intent is to present a variety of data to the algorithm without judgement around the relevance of the data elements. Based on identified frauds, the intent is for the machine to develop a model that can be tested on these known frauds through a variety of algorithmic techniques.

3.0 Exercise Objectives

Explore various machine learning techniques to improve accuracy of detection in imbalanced samples. The impact of feature engineering, feature selection and parameter tweaking are explored with objective of achieving superior predictive performance.

As a procedure, the data will be split into three different segments – training, testing and cross-validation. The algorithm will be trained on a partial set of data and parameters tweaked on a testing set. This will be examined for performance on the cross-validation set. The high-performing models will be then tested for various random splits of data to ensure consistency in results.

The exercise was conducted on ApolloTM – Wipro’s Anomaly Detection Platform, which applies a combination of pre-defined rules and predictive machine learning algorithms to identify outliers in data. It is built on Open Source with a library of pre-built algorithms that enable rapid deployment, and can be customized and managed. This Big Data Platform is comprised of three layers as indicated below.

Three layers of Apollo’s architecture

Data Handling:

  • Data Clensing
  • Transformation
  • Tokenizing

Detection Layer

  • Business Rules
  • ML Algorithims

Outcomes

  • Dashboards
  • Detailed Reports
  • Case Management

The exercise described above was performed on four different insurance datasets. The names cannot be declared, given reasons of confidentiality.

Data descriptions for the datasets are given below.

4.0 Data Set Description

4.I Introduction to Datasets

Table 1: Features of various datasets

4.2 Detailed Description of Datasets

Overall Features:

The insurance dataset can be classified into different categories of details like policy details, claim details, party details, vehicle details, repair details, risk details. Some attributes that are listed in the datasets are: categorical attributes with names: Vehicle Style, Gender, Marital Status, License Type, and Injury Type etc. Date attributes with names: Loss Date, Claim Date, and Police Notified Date etc. Numerical attributes with names: Repair Amount, Sum Insured, Market Value etc.

For better data exploration, the data is divided and explored based on the perspectives of both the insured party and the third party. After doing some Exploratory Data Analysis (EDA) on all the datasets, some key insights are listed below

Dataset – 1:

  • Out of all fraudulent claims, 20% of them have involvement with multiple par ties and when a multiple par ty is involved there is a 73% of chance to perform a fraud
  • 11% of the fraudulent claims occurred on holiday weeks. And also when an accident happens on holiday week it is 80% more likely to be a fraud

Dataset – 2:

For Insured:

  • 72% of the claimants’ vehicles drivability status is unknown whereas in non-fraud claims; most of the vehicles have a drivability status as yes or no
  • 75% of the claimants have their license type as blank which is suspicious because the non-fraud claims have their license types mentioned

For Third Party:

  • 97% of the third party vehicles involving fraud are drivable but the claim amount is very high (i.e. the accident is not serious but the claim amount is high)
  • 97% of the claimants have their license type as blank which is again suspicious because the non-fraud claims have their license types mentioned

Related Solutions

Explain the benefits of Activity Based Costing. Please provide a comprehensive response to this question.
Explain the benefits of Activity Based Costing. Please provide a comprehensive response to this question.
Cross validation: If we perform k-fold cross validation, training a Naive Bayes model, how many models...
Cross validation: If we perform k-fold cross validation, training a Naive Bayes model, how many models will we end up creating?
How can I analyze "Enron Dataset" using ONLY "Naive Bayes model" or "SVM(Support Vector Machine) model"...
How can I analyze "Enron Dataset" using ONLY "Naive Bayes model" or "SVM(Support Vector Machine) model" or "Decision Trees model" or  "Random Forest model" or "K Nearest Neighbors model". And for what purpose the result can be used? Please give me some rough ideas and methods. No need to right down the Python codes.
Describe in detail how naive B lymphocytes become a plasma cell
Describe in detail how naive B lymphocytes become a plasma cell
How does Internal Rate of Return differ from Net Present Value? Please provide a comprehensive response...
How does Internal Rate of Return differ from Net Present Value? Please provide a comprehensive response to this question.
1. Explain how threads are used by the CPU to process tasks by describing a modern...
1. Explain how threads are used by the CPU to process tasks by describing a modern example, e.g., the multi-core mobile phone that you use every day has an interesting organisation of threads. However, it can be any other modern example of hardware that uses ―threads. 2. There are a number of techniques used by CPU designers to improve the performance of their processors. However, these optimisation strategies do not always work – for some workloads, they may have no...
What is Response to Intervention (RTI) and how is it used to support students? Explain your...
What is Response to Intervention (RTI) and how is it used to support students? Explain your response.
1. Explain how probabilities are used in budgeting and provide an example of how a sales...
1. Explain how probabilities are used in budgeting and provide an example of how a sales budget could be developed using probabilities. 2. What may be some of the reasons a company experiences actual sales that exceed budget sales. What might this variance reveal about the company’s budgeting processes? 3. When might a company consider revising a budget?
The document that Covered Entities are required to provide to patients describing how the CE will...
The document that Covered Entities are required to provide to patients describing how the CE will use or disclose the PHI and what the patient’s rights and CE’s duties are is called:    a. breach notification b. business associate agreement c. facility directory d. notice of privacy practices A defense to a claim of negligence might be:    a. self defense b. contributory negligence c. lack of contractual capacity d. responsible corporate officer
Explain the meaning of mezzanine finance, describing the circumstances in which it is most commonly used.
Explain the meaning of mezzanine finance, describing the circumstances in which it is most commonly used.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT