In: Computer Science
Question 1:
Please fetch the ‘Price’ column of the ford_escort.csv dataset.
Count the number of rows, and calculate the max, min, stdv and average of this column.
Show me the cars that their price is above the average.
Here is the link to download ford_escort.csv: https://www.dropbox.com/s/qczaguno5hfdico/ford_escort.csv?dl=0
Question 2:
The followings are the attributes of the wine quality dataset:
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
12. quality (score between 0 and 10)
Please split the dataset into 80% for training and 20% for testing. Use two different machine learning algorithms to predict the labels for the testing segment (the 20% that you have separated for the testing).
Calculate the accuracy and see which algorithm provides better accuracy. You can use any ML algorithm, such as the decision tree, KNN, or any others.
Here is the link to download wine.csv:
https://www.dropbox.com/s/v0t9tb4gq8tiqh8/wine.csv?dl=0
Answers for Question 1 :
import pandas as pd
df = pd.read_csv (r'ford_escort.csv')
mean_price = df['Price'].mean()
max_price = df['Price'].max()
min_price = df['Price'].min()
count = df['Price'].count()
std_price = df['Price'].std()
print ('No. of Rows : ' + str(count))
print ('Max Price : ' + str(max_price))
print ('Min Price : ' + str(min_price))
print ('Stdv Price : ' + str(std_price))
print ('Average of Price : ' + str(mean_price))
print("\nCars that are above average price are :\n")
cars = df
[['Year','Mileage','Price']][df.Price>df['Price'].mean()]
print(cars)
Answer for Question 2 :
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
df = pd.read_csv("wine.csv")
print("Rows, columns: " + str(df.shape))
O/p : Rows, columns: (1599, 12)
df.head()
o/p:
fixed acidity volatile acidity citric
acid residual sugar chlorides
free sulfur dioxide total sulfur dioxide
density pH sulphates
alcohol quality
0 7.4 0.70 0.00
1.9 0.076 11 34
0.9978 3.51 0.56
9.4 5
1 7.8 0.88 0.00
2.6 0.098 25 67
0.9968 3.20 0.68
9.8 5
2 7.8 0.76 0.04
2.3 0.092 15 54
0.9970 3.26 0.65
9.8 5
3 11.2 0.28 0.56
1.9 0.075 17 60
0.9980 3.16 0.58
9.8 6
4 7.4 0.70 0.00
1.9 0.076 11 34
0.9978 3.51 0.56
9.4 5
In [41]:
print(df.isna().sum())
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
In [42]:
# Create Classification version of target variable
df['goodquality'] = [1 if x >= 7 else 0 for x in
df['quality']]
In [43]:
# Separate feature variables and target variable
X = df.drop(['quality','goodquality'], axis = 1)
y = df['goodquality']
In [44]:
# See proportion of good vs bad wines
df['goodquality'].value_counts()
Out[44]:
0 1382
1 217
Name: goodquality, dtype: int64
In [45]:
# Normalize feature variables
from sklearn.preprocessing import StandardScaler
X_features = X
X = StandardScaler().fit_transform(X)
In [46]:
# Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=.2, random_state=0)
In [47]:
# Decision Tree
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
model1 = DecisionTreeClassifier(random_state=1)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
print(classification_report(y_test, y_pred1))
precision recall f1-score support
0 0.97 0.91 0.94 290
1 0.48 0.77 0.59 30
accuracy 0.90 320
macro avg 0.73 0.84 0.77 320
weighted avg 0.93 0.90 0.91 320
In [48]:
#Random Forest
from sklearn.ensemble import RandomForestClassifier
model2 = RandomForestClassifier(random_state=1)
model2.fit(X_train, y_train)
y_pred2 = model2.predict(X_test)
print(classification_report(y_test, y_pred2))
precision recall f1-score support
0 0.95 0.97 0.96 290
1 0.65 0.50 0.57 30
accuracy 0.93 320
macro avg 0.80 0.74 0.76 320
weighted avg 0.92 0.93 0.92 320
C:\Users\Dell\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245:
FutureWarning: The default value of n_estimators will change from
10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [49]:
#AdaBoost
from sklearn.ensemble import AdaBoostClassifier
model3 = AdaBoostClassifier(random_state=1)
model3.fit(X_train, y_train)
y_pred3 = model3.predict(X_test)
print(classification_report(y_test, y_pred3))
precision recall f1-score support
0 0.94 0.96 0.95 290
1 0.52 0.43 0.47 30
accuracy 0.91 320
macro avg 0.73 0.70 0.71 320
weighted avg 0.90 0.91 0.91 320
In [50]:
#Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
model4 = GradientBoostingClassifier(random_state=1)
model4.fit(X_train, y_train)
y_pred4 = model4.predict(X_test)
print(classification_report(y_test, y_pred4))
precision recall f1-score support
0 0.95 0.95 0.95 290
1 0.53 0.57 0.55 30
accuracy 0.91 320
macro avg 0.74 0.76 0.75 320
weighted avg 0.92 0.91 0.91 320
In [51]:
#XGBoost
import xgboost as xgb
model5 = xgb.XGBClassifier(random_state=1)
model5.fit(X_train, y_train)
y_pred5 = model5.predict(X_test)
print(classification_report(y_test, y_pred5))
precision recall f1-score support
0 0.97 0.93 0.95 290
1 0.52 0.73 0.61 30
accuracy 0.91 320
macro avg 0.75 0.83 0.78 320
weighted avg 0.93 0.91 0.92 320
By comparing the five models, the random forest and XGBoost seems
to yield the highest level of accuracy.
However, since XGBoost has a better f1-score for predicting good
quality wines, I’m concluding that the XGBoost is the best of all
Models