Question

In: Computer Science

This is a maching learning question. Using the Kaggle diamonds dataset, build a KNN based estimator...

This is a maching learning question.

Using the Kaggle diamonds dataset, build a KNN based estimator for estimating the price of a diamond and propose an appropriate K value.

Please use python and google colab format. Thank you!

Solutions

Expert Solution

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt

import pandas as pd

import cufflinks as cf

import sklearn

from sklearn import svm, preprocessing

import seaborn as sns

import plotly.graph_objs as go

import plotly.plotly as py

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import os

df = pd.read_csv('../input/diamonds.csv')

df.head()

sns.FacetGrid(df, hue = 'cut', height = 6).map(sns.distplot, 'price').add_legend()

plt.plot()

cut_dict = {'Fair' : 1, 'Good' : 2, 'Very Good' : 3, 'Premium' : 4, 'Ideal' : 5}

clarity_dict ={ 'I1' : 1, 'SI2' : 2, 'SI1' : 3, 'VS2' : 4, 'VS1' : 5, 'VVS2' : 6, 'VVS1' : 7 , 'IF' : 8}

color_dict = {'D':7, 'E':6, 'F':5, 'G':4, 'H':3, 'I':2, 'J':1}

df['cut'] = df['cut'].map(cut_dict)

df['clarity'] = df['clarity'].map(clarity_dict)

df['color'] = df['color'].map(color_dict)

df = df.drop('Unnamed: 0', axis = 1)

df.head()

df.isnull().any()

df = sklearn.utils.shuffle(df, random_state = 42)

X = df.drop(['price'], axis = 1).values

X = preprocessing.scale(X)

y = df['price'].values

y = preprocessing.scale(y)

test_size = 200

X_train = X[: -test_size]

y_train = y[: -test_size]

X_test = X[-test_size :]

y_test = y[-test_size :]

from sklearn.neighbors import KNeighborsRegressor

score = []

for k in range(1,20): # running for different K values to know which yields the max accuracy.

clf = KNeighborsRegressor(n_neighbors = k, weights = 'distance', p=1)

clf.fit(X_train, y_train)

score.append(clf.score(X_test, y_test ))

k_max = score.index(max(score))+1

print( "At K = {}, Max Accuracy = {}".format(k_max, max(score)*100))

  

clf = KNeighborsRegressor(n_neighbors = k_max, weights = 'distance', p=1)

clf.fit(X_train, y_train)

print(clf.score(X_test, y_test ))

y_pred = clf.predict(X_test)


Related Solutions

Using R Question 3. kNN Classification 3.1 Read in iris dataset using “data(iris)”. Describe the features...
Using R Question 3. kNN Classification 3.1 Read in iris dataset using “data(iris)”. Describe the features in the data using summary 3.2 Randomize the iris data set, mix it up and normalize it 3.3 split data into training & testing (70/30 split) 3.4 Train model in data and use crosstable function to evaluate the results 3.5 Rerun your code for K=10 and 100. Compare results and explain
Consider the diamonds data set. How many diamonds are there in the dataset with a cut...
Consider the diamonds data set. How many diamonds are there in the dataset with a cut considered Premium? 4906 12082 13791 21551 1610
Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1....
Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1. Explore the dataset & state insights 2. Create plots for dataset 3: Provide summary of descriptive stats) 4. Run the regressions, research, Investigate & comment on R^2 & on regression plots - 1 line each. #=========================================== # DV = Price, IV or IVs = your choice # Can we create and compare models to predict "Price"? # Question- Investigate & comment on R^2 &...
Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1....
Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1. Explore the dataset & state insights 2. Create plots for dataset 3: Provide summary of descriptive stats) 4. Run the regressions, research, Investigate & comment on R^2 & on regression plots - 1 line each. #=========================================== # DV = Price, IV or IVs = your choice # Can we create and compare models to predict "Price"? # Question- Investigate & comment on R^2 &...
Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1....
Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1. Explore the dataset & state insights 2. Create plots for dataset 3: Provide summary of descriptive stats) 4. Run the regressions, research, Investigate & comment on R^2 & on regression plots - 1 line each. #=========================================== # DV = Price, IV or IVs = your choice # Can we create and compare models to predict "Price"? # Question- Investigate & comment on R^2 &...
Python 3 Rewrite KNN sample code using KNeighborsClassifier . ● Repeat KNN Step 1 – 5,...
Python 3 Rewrite KNN sample code using KNeighborsClassifier . ● Repeat KNN Step 1 – 5, for at least five times and calculate average accuracy to be your result. ● If you use the latest version of scikit -learn, you need to program with Python >= 3.5. ● Use the same dataset: “ iris.data ” ● Split your data: 67% for training and 33% for testing ● Draw a line chart: Use a “for loop” to change k from 1...
Case Problem 1: Stock Market a. Using the dataset “Stock Market”, build a table with the...
Case Problem 1: Stock Market a. Using the dataset “Stock Market”, build a table with the descriptive statistics (N, Mean, Standard Deviation, Minimum, Median and Maximum) (10 points) • Which companies had a higher mean monthly return than the market (as measured by the S&P 500)? (5 points) • Which one was the most volatile (has the largest standard deviation)? Why is the S&P Index the less volatile? (5 points) b. Find the estimated regression equation relating each of the...
The dataset ‘diamondpricesbyrater’ (available in Canvas) contains information on the prices of samples of diamonds rated...
The dataset ‘diamondpricesbyrater’ (available in Canvas) contains information on the prices of samples of diamonds rated by agencies IGI and by HRD. Use R to conduct a hypothesis test to determine if there is a difference in the mean price of diamonds rated by the two agencies. State your hypotheses and conclusions. diamondpricesbyrater.txt IGI HRD 823 3778 765 3432 803 3851 803 3346 705 3130 725 3995 967 3701 1050 3529 967 3667 863 3202 800 3256 842 3415 800...
Using the SPSS software, open the High School Longitudinal Study dataset found in this week’s Learning...
Using the SPSS software, open the High School Longitudinal Study dataset found in this week’s Learning Resources and construct a research question that involves a comparison of a means test. Use SPSS to answer the research question you constructed and post your response to the following: 1) Research Question: To examine the research question, an independent sample t-test will be conducted to assess if differences exist on a dependent variable (T1 Parent 1: highest education level) an independent variable (students...
Using the OLS estimator:  βOLS = (X'X)-1X'y to find the estimator for the simple linear regression model:...
Using the OLS estimator:  βOLS = (X'X)-1X'y to find the estimator for the simple linear regression model: y = β1 + β2x +u from a set of data on (x, y).
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT