Question

In: Computer Science

Use Titanic dataset and perform EDA on various columns. Without using any modeling algorithms, and only...

Use Titanic dataset and perform EDA on various columns. Without using any modeling algorithms, and only using basic methods such as frequency distribution, describe the most important predictors of survival of Titanic passengers, e.g. were males or females more likely to survive, were young and rich females more likely to survive than old poor males etc?

Submit the  response in a fully "knit" R Markdown file.

Solutions

Expert Solution

Source Code:

import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

test_df = pd.read_csv("test.csv")
train_df = pd.read_csv("train.csv")

train_df.info()


total = train_df.isnull().sum().sort_values(ascending=False)
percent_1 = train_df.isnull().sum()/train_df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(5)

FacetGrid = sns.FacetGrid(train_df, row='Embarked', size=4.5, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None,  order=None, hue_order=None )
FacetGrid.add_legend()


sns.barplot(x='Pclass', y='Survived', data=train_df)


data = [train_df, test_df]
for dataset in data:
    dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
    dataset.loc[dataset['relatives'] > 0, 'not_alone'] = 0
    dataset.loc[dataset['relatives'] == 0, 'not_alone'] = 1
    dataset['not_alone'] = dataset['not_alone'].astype(int)
train_df['not_alone'].value_counts()


#data processing
train_df = train_df.drop(['PassengerId'], axis=1)

import re
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}
data = [train_df, test_df]

for dataset in data:
    dataset['Cabin'] = dataset['Cabin'].fillna("U0")
    dataset['Deck'] = dataset['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
    dataset['Deck'] = dataset['Deck'].map(deck)
    dataset['Deck'] = dataset['Deck'].fillna(0)
    dataset['Deck'] = dataset['Deck'].astype(int)
# we can now drop the cabin feature
train_df = train_df.drop(['Cabin'], axis=1)
test_df = test_df.drop(['Cabin'], axis=1)

data = [train_df, test_df]

for dataset in data:
    dataset['Fare'] = dataset['Fare'].fillna(0)
    dataset['Fare'] = dataset['Fare'].astype(int)


data = [train_df, test_df]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for dataset in data:
    # extract titles
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    # replace titles with a more common title or as Rare
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
                                            'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    # convert titles into numbers
    dataset['Title'] = dataset['Title'].map(titles)
    # filling NaN with 0, to get safe
    dataset['Title'] = dataset['Title'].fillna(0)
train_df = train_df.drop(['Name'], axis=1)
test_df = test_df.drop(['Name'], axis=1)


#convert sex into numeric value

genders = {"male": 0, "female": 1}
data = [train_df, test_df]

for dataset in data:
    dataset['Sex'] = dataset['Sex'].map(genders)

#convert embarked feature into numeric:
ports = {"S": 0, "C": 1, "Q": 2}
data = [train_df, test_df]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].map(ports)


data = [train_df, test_df]
for dataset in data:
    dataset['Age'] = dataset['Age'].astype(int)
    dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40), 'Age'] = 5
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66), 'Age'] = 6
    dataset.loc[ dataset['Age'] > 66, 'Age'] = 6



data = [train_df, test_df]
for dataset in data:
    dataset['Age_Class']= dataset['Age']* dataset['Pclass']
#Fare per Person
for dataset in data:
    dataset['Fare_Per_Person'] = dataset['Fare']/(dataset['relatives']+1)
    dataset['Fare_Per_Person'] = dataset['Fare_Per_Person'].astype(int)
# Let's take a last look at the training set, before we start training the models.
train_df.head(10)


#building models:
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
Stochastic Gradient Descent (SGD):
sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)

sgd.score(X_train, Y_train)

acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
#Random Forest:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
Logistic Regression:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

Y_pred = logreg.predict(X_test)

acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
#K Nearest Neighbor:
# KNN knn = KNeighborsClassifier(n_neighbors = 3) knn.fit(X_train, Y_train)  Y_pred = knn.predict(X_test)  acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
Gaussian Naive Bayes:
gaussian = GaussianNB() gaussian.fit(X_train, Y_train)  Y_pred = gaussian.predict(X_test)  acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
#Perceptron:
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)

Y_pred = perceptron.predict(X_test)

acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
#Linear Support Vector Machine:
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
Decision Tree
decision_tree = DecisionTreeClassifier() decision_tree.fit(X_train, Y_train)  Y_pred = decision_tree.predict(X_test)  acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
#Which is the best Model:
results = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 
              'Decision Tree'],
    'Score': [acc_linear_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(9)

 
 

Let me know if you have any doubts or if you need anything to change. 

If you are satisfied with the solution, please leave a +ve feedback : ) Let me know for any help with any other questions.

Thank You!
===========================================================================

Related Solutions

Use the following information on Company Y and perform pro-forma financial modeling using a planned expansion...
Use the following information on Company Y and perform pro-forma financial modeling using a planned expansion method to answers question (1) and (2). To do this assume that the percentage values with respect to sales of the (i) costs except depreciation, (ii) cash and equivalents, (iii) accounts receivable, (iv) inventories, and (v) accounts payable will stay fixed at the values corresponding for 2016. Assume also that income tax will remain at 35% of the Pretax Income. Consider Company Y. This...
Use the following information on Company Y and perform pro-forma financial modeling using a planned expansion...
Use the following information on Company Y and perform pro-forma financial modeling using a planned expansion method to answers question (1) and (2). To do this assume that the percentage values with respect to sales of the (i) costs except depreciation, (ii) cash and equivalents, (iii) accounts receivable, (iv) inventories, and (v) accounts payable will stay fixed at the values corresponding for 2016. Assume also that income tax will remain at 35% of the Pretax Income. Consider Company Y. This...
Use the following information on Company Y and perform pro-forma financial modeling using a planned expansion...
Use the following information on Company Y and perform pro-forma financial modeling using a planned expansion method to answers question (1) and (2). To do this assume that the percentage values with respect to sales of the (i) costs except depreciation, (ii) cash and equivalents, (iii) accounts receivable, (iv) inventories, and (v) accounts payable will stay fixed at the values corresponding for 2016. Assume also that income tax will remain at 35% of the Pretax Income. Consider Company Y. This...
Hi, I want to programm a c++ gamr without the using of any array. only by...
Hi, I want to programm a c++ gamr without the using of any array. only by using FUNCTIONS, WHILE LOOPS, FOR LOOPS, IF ELSE, SRADN() and RAND(), BREAK STATEMENT, CIN and COUT. Please help me out.
Using R and R Commander, perform a Chi-squared test of independence with marstatus in the Columns...
Using R and R Commander, perform a Chi-squared test of independence with marstatus in the Columns and gender in the Rows. Click on Statistics tab and select the first three options under Hypothesis Tests. Copy the output and paste it below this question. Is the Chi-squared test significant at the 5% alpha level? Are the results reliable? Please see data below. Can you explain? Frequency table:         marstatus gender   Divorced Married Single Female        2       6      9 Male          1       0      6...
Create a table in SQL with foreign key reference: 1.Create the three tables without any columns...
Create a table in SQL with foreign key reference: 1.Create the three tables without any columns 2.Alter the tables to add the columns 3.Alter the tables to create the primary and foreign keys
1-Which of the following algorithms cannot be implemented without using the timer interrupt? a. RR b....
1-Which of the following algorithms cannot be implemented without using the timer interrupt? a. RR b. SJF c. Priority d. FCFC 2-A program that performs mostly arithmetic or scientific calculations is considered as? a. CPU bound b. I/O bound 3-Operating System is: a. System processes that manages hardware resources b. System processes that provide services to the user application programs c. Interface between the hardware and user processes d. All the above e. Not of the above 4-A goal of...
1. Use Universal Bank dataset. Note that Personal.Loan is the outcome variable of interest. a. Perform...
1. Use Universal Bank dataset. Note that Personal.Loan is the outcome variable of interest. a. Perform a k-NN classification with k=3 for a new data (You are welcome to choose your own values for the new data. Please clearly state it in your report). b. Identify the best k. Why do you think this is the best? (Hint: Explain what happens if k increases and if k decreases). c. Calculate accuracy, sensitivity, and specificity for your validation data using the...
Using the ruspini dataset provided with the cluster package in R, perform a k-means analysis. Document...
Using the ruspini dataset provided with the cluster package in R, perform a k-means analysis. Document the findings and justify the choice of K. Hint: use data(ruspini) to load the dataset into the R workspace.
(18 pts) Use the “Car Filter.sav” (SPSS) dataset (located below) to perform a Two-Way ANOVA. This...
(18 pts) Use the “Car Filter.sav” (SPSS) dataset (located below) to perform a Two-Way ANOVA. This dataset contains noise level readings taken from inside different sized cars equipped with different sized filters. Write your findings using the format presented in the class slides. (2 pts) What is the mean and standard deviation of each filter’s noise level? (2 pts) Based on the means and standard deviations calculated above for each filter, do you believe the mean noise levels are statistically...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT