In: Computer Science
The dataset for this assignment contains house prices as well as 19 other features for each property. Those features are detailed below and include information about the house (number of bedrooms, bathrooms…), the lot (square footage…) and the sale conditions (period of the year…) The overall goal of the assignment is to predict the sale price of a house by using a linear regression. For this assignment, the training set is in the file "house_prices_train.csv" and the test set is in the file "house_prices_test.csv"
Here is a brief description of each feature in the dataset:
I completed the code correctly for question 1a(Open the training dataset and remove all rows that contain at least one missing value (NA) & Return the new clean dataset and the number of rows in that dataset) but need help with the rest of the question. This is my code:
def clean_data():
import pandas as pd
data = pd.read_csv('house_prices_train.csv', index_col=0)
data_train = data.dropna()
nb_rows = data_train.shape[0]
return([nb_rows, data_train])
Question 1b:
For the training dataset, print a summary of the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr” and “SalePrice”. Return the whole summary and a list containing (in that order):
Hint: Use the built-in method describe() for a pandas.DataFrame
Here's the sample code i was given to start off:
def summary(data_train):
# Code goes here
# max_sale = maximum sale price in the training dataset
# min_garea = mining garage area
# fstq_lotarea = first quartile of lot area
# scd_ybuilt = second most common year built
# mean_bed = mean number of bedrooms above ground
### YOUR CODE HERE
return([max_sale, min_garea, fstq_lotarea, scd_ybuilt,
mean_bed])
Question 1c:
Run a linear regression on "SalePrice" using the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr”. For each variable, return the coefficient associated to the regression in a dictionary similar to this: {“LotArea”: 1.888, “YearBuilt”: -0.06, ...} (This is only an example not the right answer)
Compute the Root Mean Squared Error (RMSE) using the file "house_prices_test.csv" to measure the out-of-sample performance of the model.
################# Function to fit your Linear Regression Model
###################
def linear_regression_all_variables(data_train):
from sklearn import linear_model
# Code goes here
# dict_coeff = dictionnary (key = name of the variable, value =
coefficient in the linear
# regression model)
# lreg = your linear regression model
###
### YOUR CODE HERE
###
return([dict_coeff, lreg])
Question 1d:
Refit the model on the training set using all the variables and return the RMSE on the test set.
(The first column "unnamed: 0" is not a variable)
################# Function to compute the Root Mean Squared
Error ###################
def compute_mse_test(data_train, data_test):
from sklearn import linear_model, metrics
dict_coeff, lreg =
linear_regression_all_variables(data_train)
###
### YOUR CODE HERE
###
# rmse = Root Mean Squared Error
return(rmse)
def linear_regression_all(data_train, data_test)
from sklearn import linear_model, metrics
#Code goes here
#rmse = root mean squared error of the second linear regression on
the test dataset
###
### YOUR CODE HERE
###
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
return (rmse)
Using the given insight and considering the fact that the dataset cannot be given, I have tried to make the best approximation of how the approach should be.
def summary(data_train):
max_sale = data_train.SalePrice.max()
min_garea = data_train.GarageArea.min()
fstq_lotarea = data_train.LotArea.quantile(.25)
scd_ybuilt = data_train.YearBuilt.mode()[1] # mode() gives all the highest in order
mean_bed = data_train.BedroomAbvGr.mean()
return([max_sale, min_garea, fstq_lotarea, scd_ybuilt, mean_bed])
################# Function to fit your Linear Regression Model ###################
def linear_regression_all_variables(data_train):
from sklearn import linear_model
lreg = linear_model.LinearRegression()
X_train=data_train[['LotArea','YearBuilt','GarageArea','BedroomAbvGr']] # using the variables asked for training
y_train=data_train['SalePrice']
lreg.fit(X_train, y_train)
dict_coeff={'LotArea': lreg.coef_[0],'YearBuilt': lreg.coef_[1],'GarageArea': lreg.coef_[2],'BedroomAbvGr': lreg.coef_[3]} # lreg.coef_ gives a list of coefficients of the features
return([dict_coeff, lreg])
################# Function to compute the Root Mean Squared Error ###################
def compute_mse_test(data_train, data_test):
from sklearn import linear_model, metrics
dict_coeff, lreg = linear_regression_all_variables(data_train)
X_test=data_test[['LotArea','YearBuilt','GarageArea','BedroomAbvGr']]
y_test=data_test['SalePrice']
y_pred = lreg.predict(X_test)
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred)) # assuming numpy is already imported as np
return(rmse)
def linear_regression_all(data_train, data_test):
from sklearn import linear_model, metrics
lreg = linear_model.LinearRegression()
X_train=data_train.loc[:, data_train.columns != 'SalePrice'] # selecting all columns except the SalePrice which we have to predict
y_train=data_train['SalePrice']
lreg.fit(X_train, y_train)
X_test=data_test.loc[:, data_test.columns != 'SalePrice'] # selecting all columns except the SalePrice which we have to predict
y_test=data_test['SalePrice']
y_pred = lreg.predict(X_test)
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred)) # assuming numpy is already imported as np
return (rmse)