Question

In: Computer Science

The dataset for this assignment contains house prices as well as 19 other features for each...

The dataset for this assignment contains house prices as well as 19 other features for each property. Those features are detailed below and include information about the house (number of bedrooms, bathrooms…), the lot (square footage…) and the sale conditions (period of the year…) The overall goal of the assignment is to predict the sale price of a house by using a linear regression. For this assignment, the training set is in the file "house_prices_train.csv" and the test set is in the file "house_prices_test.csv"

Here is a brief description of each feature in the dataset:

  • SalePrice: the property's sale price in dollars. This is the target variable that you're trying to predict.
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • YearBuilt: Original construction date
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • BedroomAbvGr: Number of bedrooms above basement level
  • KitchenAbvGr: Number of kitchens
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • PoolArea: Pool area in square feet
  • MoSold: Month Sold
  • YrSold: Year Sold

I completed the code correctly for question 1a(Open the training dataset and remove all rows that contain at least one missing value (NA) & Return the new clean dataset and the number of rows in that dataset) but need help with the rest of the question. This is my code:

def clean_data():
import pandas as pd
data = pd.read_csv('house_prices_train.csv', index_col=0)
data_train = data.dropna()
nb_rows = data_train.shape[0]
  
return([nb_rows, data_train])

Question 1b:

For the training dataset, print a summary of the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr” and “SalePrice”. Return the whole summary and a list containing (in that order):

  • The maximum sale price
  • The minimum garage area
  • The first quartile of lot area
  • The second most common year built
  • The mean of BedroomAbvGr

Hint: Use the built-in method describe() for a pandas.DataFrame

Here's the sample code i was given to start off:

def summary(data_train):
# Code goes here
# max_sale = maximum sale price in the training dataset
# min_garea = mining garage area
# fstq_lotarea = first quartile of lot area
# scd_ybuilt = second most common year built
# mean_bed = mean number of bedrooms above ground
### YOUR CODE HERE
return([max_sale, min_garea, fstq_lotarea, scd_ybuilt, mean_bed])

Question 1c:

Run a linear regression on "SalePrice" using the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr”. For each variable, return the coefficient associated to the regression in a dictionary similar to this: {“LotArea”: 1.888, “YearBuilt”: -0.06, ...} (This is only an example not the right answer)

Compute the Root Mean Squared Error (RMSE) using the file "house_prices_test.csv" to measure the out-of-sample performance of the model.

################# Function to fit your Linear Regression Model ###################
def linear_regression_all_variables(data_train):
from sklearn import linear_model
  
# Code goes here
# dict_coeff = dictionnary (key = name of the variable, value = coefficient in the linear
# regression model)
# lreg = your linear regression model
###
### YOUR CODE HERE
###
  
return([dict_coeff, lreg])

Question 1d:

Refit the model on the training set using all the variables and return the RMSE on the test set.

(The first column "unnamed: 0" is not a variable)

################# Function to compute the Root Mean Squared Error ###################
def compute_mse_test(data_train, data_test):
from sklearn import linear_model, metrics
  
dict_coeff, lreg = linear_regression_all_variables(data_train)
###
### YOUR CODE HERE
###
# rmse = Root Mean Squared Error
return(rmse)

def linear_regression_all(data_train, data_test)

from sklearn import linear_model, metrics
  
#Code goes here
  
#rmse = root mean squared error of the second linear regression on the test dataset
###
### YOUR CODE HERE
###
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
  
return (rmse)

Solutions

Expert Solution

Using the given insight and considering the fact that the dataset cannot be given, I have tried to make the best approximation of how the approach should be.

def summary(data_train):
  max_sale = data_train.SalePrice.max()
  min_garea = data_train.GarageArea.min()
  fstq_lotarea = data_train.LotArea.quantile(.25)
  scd_ybuilt = data_train.YearBuilt.mode()[1] # mode() gives all the highest in order
  mean_bed = data_train.BedroomAbvGr.mean()
  
  return([max_sale, min_garea, fstq_lotarea, scd_ybuilt, mean_bed])
################# Function to fit your Linear Regression Model ###################
def linear_regression_all_variables(data_train):
  from sklearn import linear_model
  lreg = linear_model.LinearRegression()
  X_train=data_train[['LotArea','YearBuilt','GarageArea','BedroomAbvGr']] # using the variables asked for training
  y_train=data_train['SalePrice']
  lreg.fit(X_train, y_train)
  dict_coeff={'LotArea': lreg.coef_[0],'YearBuilt': lreg.coef_[1],'GarageArea': lreg.coef_[2],'BedroomAbvGr': lreg.coef_[3]} # lreg.coef_ gives a list of coefficients of the features 

  return([dict_coeff, lreg])
################# Function to compute the Root Mean Squared Error ###################
def compute_mse_test(data_train, data_test):
  from sklearn import linear_model, metrics

  dict_coeff, lreg = linear_regression_all_variables(data_train)
  X_test=data_test[['LotArea','YearBuilt','GarageArea','BedroomAbvGr']] 
  y_test=data_test['SalePrice']
  y_pred = lreg.predict(X_test)
  rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred)) # assuming numpy is already imported as np

  return(rmse)
def linear_regression_all(data_train, data_test):
  from sklearn import linear_model, metrics
  lreg = linear_model.LinearRegression()
  X_train=data_train.loc[:, data_train.columns != 'SalePrice'] # selecting all columns except the SalePrice which we have to predict
  y_train=data_train['SalePrice']
  lreg.fit(X_train, y_train)
  X_test=data_test.loc[:, data_test.columns != 'SalePrice'] # selecting all columns except the SalePrice which we have to predict
  y_test=data_test['SalePrice']
  y_pred = lreg.predict(X_test)
  
  rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred)) # assuming numpy is already imported as np
    
  return (rmse)

Related Solutions

The dataset ‘diamondpricesbyrater’ (available in Canvas) contains information on the prices of samples of diamonds rated...
The dataset ‘diamondpricesbyrater’ (available in Canvas) contains information on the prices of samples of diamonds rated by agencies IGI and by HRD. Use R to conduct a hypothesis test to determine if there is a difference in the mean price of diamonds rated by the two agencies. State your hypotheses and conclusions. diamondpricesbyrater.txt IGI HRD 823 3778 765 3432 803 3851 803 3346 705 3130 725 3995 967 3701 1050 3529 967 3667 863 3202 800 3256 842 3415 800...
Assignment Chapter 2 The dataset below contains information on the 50 states in the US, including...
Assignment Chapter 2 The dataset below contains information on the 50 states in the US, including 2 categorical and 13 quantitative variables. In the questions that follow, I ask you to use technology to do some analysis of this dataset. What are the cases? What is the sample size? Choose one of the two categorical variables and create a frequency table and a relative frequency table of the values.. Choose one of the quantitative variables and use technology to create...
Macroeconomic indicator: House prices after COVID-19 in Lower Mainland in Canada
Macroeconomic indicator: House prices after COVID-19 in Lower Mainland in Canada
1) Each molecule of testosterone contains 19 atoms of carbon (plus other atom). The mass percent...
1) Each molecule of testosterone contains 19 atoms of carbon (plus other atom). The mass percent of carbon in testosterone is 79.12%. What is the molar mass of testosterone? 2) A substance contains 23.0g of sodium, 27.0 g aluminum and 114.0 g fluorine. How many grams of sodium are there in a 159 g sample of the substance?
Who Is More Likely to Smoke: Males or Females? The dataset NutritionStudy contains, among other things,...
Who Is More Likely to Smoke: Males or Females? The dataset NutritionStudy contains, among other things, information about smoking history and gender of the participants. Is there a significant association between these two variables? Use a statistical software package and the variables PriorSmoke and Gender to conduct a chi-square analysis and clearly give the results. The variable PriorSmoke is coded as 1= never smoked, 2= prior smoker, and 3= current smoker. a) What is the chi-square statistic? What is the...
A house has well-insulated walls. It contains a volume of 90 m3 of air at 295...
A house has well-insulated walls. It contains a volume of 90 m3 of air at 295 K. (a) Consider heating it at constant pressure. Calculate the energy required to increase the temperature of this diatomic ideal gas by 1.5�C. (b) If this energy could be used to lift an object of mass m through a height of 2.3 m, what is the value of m?
Imagine you have a 6-class classification problem, where the dataset contains 9 input features. You decide...
Imagine you have a 6-class classification problem, where the dataset contains 9 input features. You decide to build a classifier using a “mixture of mixtures”, i.e. using a Gaussian mixture model for each likelihood (p(x|θ)). 3 mixture components are used with diagonal covariance matrices for each mixture model. Calculate the total number of model parameters in the classifier (do not consider priors).
Please provide statistical data on House prices after COVID-19 in Lower Mainland in Canada
Please provide statistical data on House prices after COVID-19 in Lower Mainland in Canada
1) One of the ten features that LDCs tend to have in common with each other...
1) One of the ten features that LDCs tend to have in common with each other All of the other answers are correct. is adverse geography. is that they have less social fractionalization. is that they tend to have higher levels of industrialization. 2) The Solow economic growth model generalizes the Harrod-Domar model by modifying the assumption that relates the changes in capital and the changes in output in terms of a neoclassical production function.: True False 3) Canada in...
Drug manufacturers compete with each other with prices to attract insurers. The government imposes the following...
Drug manufacturers compete with each other with prices to attract insurers. The government imposes the following rule. All manufactures must charge Medicare the lowest price that they charge other insurance providers. Using the ideas of strategic moves, explain why this policy results in higher average drug prices.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT