Question

In: Computer Science

Question 1: Please fetch the ‘Price’ column of the ford_escort.csv dataset. Count the number of rows,...

Question 1:

Please fetch the ‘Price’ column of the ford_escort.csv dataset.

Count the number of rows, and calculate the max, min, stdv and average of this column.

Show me the cars that their price is above the average.

Here is the link to download ford_escort.csv: https://www.dropbox.com/s/qczaguno5hfdico/ford_escort.csv?dl=0

Question 2:

The followings are the attributes of the wine quality dataset:

Input variables (based on physicochemical tests):

  1. fixed acidity
  2. volatile acidity
  3. citric acid
  4. residual sugar
  5. chlorides
  6. free sulfur dioxide
  7. total sulfur dioxide
  8. density
  9. pH
  10. sulphates
  11. alcohol

Output variable (based on sensory data):

12. quality (score between 0 and 10)

Please split the dataset into 80% for training and 20% for testing. Use two different machine learning algorithms to predict the labels for the testing segment (the 20% that you have separated for the testing).

Calculate the accuracy and see which algorithm provides better accuracy. You can use any ML algorithm, such as the decision tree, KNN, or any others.

Here is the link to download wine.csv:

https://www.dropbox.com/s/v0t9tb4gq8tiqh8/wine.csv?dl=0

Solutions

Expert Solution

Answers for Question 1 :

import pandas as pd
df = pd.read_csv (r'ford_escort.csv')
mean_price = df['Price'].mean()
max_price = df['Price'].max()
min_price = df['Price'].min()
count = df['Price'].count()
std_price = df['Price'].std()

print ('No. of Rows : ' + str(count))
print ('Max Price : ' + str(max_price))
print ('Min Price : ' + str(min_price))
print ('Stdv Price : ' + str(std_price))
print ('Average of Price : ' + str(mean_price))

print("\nCars that are above average price are :\n")
cars = df [['Year','Mileage','Price']][df.Price>df['Price'].mean()]
print(cars)

Answer for Question 2 :

import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns

df = pd.read_csv("wine.csv")
print("Rows, columns: " + str(df.shape))

O/p : Rows, columns: (1599, 12)

df.head()

o/p:
fixed acidity   volatile acidity   citric acid   residual sugar   chlorides   free sulfur dioxide   total sulfur dioxide   density   pH   sulphates   alcohol   quality
0   7.4   0.70   0.00   1.9   0.076   11   34   0.9978   3.51   0.56   9.4   5
1   7.8   0.88   0.00   2.6   0.098   25   67   0.9968   3.20   0.68   9.8   5
2   7.8   0.76   0.04   2.3   0.092   15   54   0.9970   3.26   0.65   9.8   5
3   11.2   0.28   0.56   1.9   0.075   17   60   0.9980   3.16   0.58   9.8   6
4   7.4   0.70   0.00   1.9   0.076   11   34   0.9978   3.51   0.56   9.4   5
In [41]:
print(df.isna().sum())
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
In [42]:
# Create Classification version of target variable
df['goodquality'] = [1 if x >= 7 else 0 for x in df['quality']]
In [43]:
# Separate feature variables and target variable
X = df.drop(['quality','goodquality'], axis = 1)
y = df['goodquality']
In [44]:
# See proportion of good vs bad wines
df['goodquality'].value_counts()
Out[44]:
0 1382
1 217
Name: goodquality, dtype: int64
In [45]:
# Normalize feature variables
from sklearn.preprocessing import StandardScaler
X_features = X
X = StandardScaler().fit_transform(X)
In [46]:
# Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
In [47]:
# Decision Tree
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

model1 = DecisionTreeClassifier(random_state=1)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)

print(classification_report(y_test, y_pred1))
precision recall f1-score support

0 0.97 0.91 0.94 290
1 0.48 0.77 0.59 30

accuracy 0.90 320
macro avg 0.73 0.84 0.77 320
weighted avg 0.93 0.90 0.91 320

In [48]:
#Random Forest
from sklearn.ensemble import RandomForestClassifier

model2 = RandomForestClassifier(random_state=1)
model2.fit(X_train, y_train)
y_pred2 = model2.predict(X_test)

print(classification_report(y_test, y_pred2))
precision recall f1-score support

0 0.95 0.97 0.96 290
1 0.65 0.50 0.57 30

accuracy 0.93 320
macro avg 0.80 0.74 0.76 320
weighted avg 0.92 0.93 0.92 320

C:\Users\Dell\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [49]:
#AdaBoost
from sklearn.ensemble import AdaBoostClassifier

model3 = AdaBoostClassifier(random_state=1)
model3.fit(X_train, y_train)
y_pred3 = model3.predict(X_test)

print(classification_report(y_test, y_pred3))
precision recall f1-score support

0 0.94 0.96 0.95 290
1 0.52 0.43 0.47 30

accuracy 0.91 320
macro avg 0.73 0.70 0.71 320
weighted avg 0.90 0.91 0.91 320

In [50]:
#Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier

model4 = GradientBoostingClassifier(random_state=1)
model4.fit(X_train, y_train)
y_pred4 = model4.predict(X_test)

print(classification_report(y_test, y_pred4))
precision recall f1-score support

0 0.95 0.95 0.95 290
1 0.53 0.57 0.55 30

accuracy 0.91 320
macro avg 0.74 0.76 0.75 320
weighted avg 0.92 0.91 0.91 320

In [51]:
#XGBoost
import xgboost as xgb

model5 = xgb.XGBClassifier(random_state=1)
model5.fit(X_train, y_train)
y_pred5 = model5.predict(X_test)

print(classification_report(y_test, y_pred5))
precision recall f1-score support

0 0.97 0.93 0.95 290
1 0.52 0.73 0.61 30

accuracy 0.91 320
macro avg 0.75 0.83 0.78 320
weighted avg 0.93 0.91 0.92 320

By comparing the five models, the random forest and XGBoost seems to yield the highest level of accuracy.

However, since XGBoost has a better f1-score for predicting good quality wines, I’m concluding that the XGBoost is the best of all Models


Related Solutions

This dataset includes the number of work hours for each project, the function point count for...
This dataset includes the number of work hours for each project, the function point count for each project, and identifiers for operating system, data management system, and programming language utilized. Open the dataset pointworkload.csv in Excel. Create a new column that calculates the number of work hours per function point for each project. FunctionPointCount WorkHours OS DMS Language 1059 15000 1 5 1 234 1850 1 5 1 1533 13033 1 5 1 339 11742 1 2 1 205 283...
[PLEASE USE C++] Write a function to read values of a number of rows, number of...
[PLEASE USE C++] Write a function to read values of a number of rows, number of columns, 2 dimensional (2D) array elements and display the 2D array in a matrix form. Input 2 3 1 4 5 2 3 0 Where, First line of represents the number of rows. Second line of input represents the number of columns. Third line contains array elements of the 1st row and so on. Output 1 4 5 2 3 0 where There must...
1. When specifying a column name inside of a COUNT(), what does the function actually count?...
1. When specifying a column name inside of a COUNT(), what does the function actually count? (E.g., COUNT(name)) a. The number of rows in the column b.The number of rows in the table c.The number of non-null rows d.The number of non-null columns 3. Which SQL clause is used to sort the output? a.ORDER BY b.SORT c.ORDER d.SORT BY Which SQL statement is used to extract data from a database? a.OPEN b.SELECT c.GET d.EXTRACT The OR operator displays a record...
1. In R Studio create a plot with two subplots (2 rows, 1 column): a). Display...
1. In R Studio create a plot with two subplots (2 rows, 1 column): a). Display five Gaussian pdf curves for the six mean/standard deviation pairs (0, 1), (0, 2), (0, 3), (1, 1), (1, 2), (1.3) in the upper subplot using distinct colors for each curve. b). Repeat the same procedure for corresponding cdfs in the lower plot.
Please answer the following Question in 300 word count Please answer in your own Count. if...
Please answer the following Question in 300 word count Please answer in your own Count. if citing source please add reference at the end of question. You are the chief financial officer (CFO) at a community hospital. One of the comments that has come back from patient surveys is the need for a commercial 24-hour pharmacy within the hospital. In this way, patients or their families will be able to fill prescriptions and begin taking ordered medication right away instead...
Please answer the following Question in 300 word count Please answer in your own Count. if...
Please answer the following Question in 300 word count Please answer in your own Count. if citing source please add reference at the end of question. You are the chief financial officer (CFO) at a community hospital. One of the comments that has come back from patient surveys is the need for a commercial 24-hour pharmacy within the hospital. In this way, patients or their families will be able to fill prescriptions and begin taking ordered medication right away instead...
Please answer the following Question in 300 word count Please answer in your own Count. if...
Please answer the following Question in 300 word count Please answer in your own Count. if citing source please add reference at the end of question. You are the chief financial officer (CFO) at a community hospital. One of the comments that has come back from patient surveys is the need for a commercial 24-hour pharmacy within the hospital. In this way, patients or their families will be able to fill prescriptions and begin taking ordered medication right away instead...
1. Write a code that constructs a matrix with 5 rows that contain the number 1...
1. Write a code that constructs a matrix with 5 rows that contain the number 1 up to 30. Your output matrix must be filled by the columns. 2. Create the vector_a containing five numeric values 5,7,8,9,10 and vector_b containing five numeric values 2,4,7,9,19. Write the output of the sum of vector_a and vector_b. 3. Create the vector_a containing five numeric values 5,7,8,9,10 and vector_b containing five numeric values 2,4,7,9,19. Write a code that combines vector_a and vector_b.
1) Matching. Fill in the number matching column A with the column B _______ adsorption chromatography                      ...
1) Matching. Fill in the number matching column A with the column B _______ adsorption chromatography                          1. Solute equilibrates between mobile phase and surface of stat. phase _______ Reverse phase                                                        2. Effluent from a chromatographic column _______ ion-exchange chromatography                              3. Relates resolving power of a column with parameters that cause peak                                                                                      broadening         _______ molecular exclusion chromatography                      4. The most common type of column used in partition chromatography _______ affinity chromatograhy                                              5. Different size solutes penetrate...
Question 1. The average number of cereal calories is around 150. Create the column “Cereal Calories”...
Question 1. The average number of cereal calories is around 150. Create the column “Cereal Calories” in the New Workbook in Excel and Generate the data for 46 ready-to-eat cereals by using the following function: =RAND()*(Upper limit-Lower Limit)+Lower limit, where Lower limit -is 80 and Upper limit is 270. Copy and Paste special>Values in order to continue to work on the generated data. Make the format of values as a number without decimal places (Format Cells>Number) Question 2. Construct a...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT