In: Computer Science
Explain if lines from "Import pandas as pd" to "#predicting the value" answers the questions to question 1.
Also explain what is supposed to found using the lines "#income: avg. area income" to "#address."
What does this mean "#Also try to see if the model performance can be improved with feature selection."
#1. #Fit a linear regression model on data: USA_housing.csv to
predict the Price of the house.
import pandas as pd
housing_df = pd.read_csv("USA_Housing.csv")
from sklearn.linear_model import LinearRegression
#loading the dataset
housing_df = pd.read_csv('USA_Housing.csv')
#dropping the address feature as it not not important in
building the model
housing_df = housing_df.drop(['Address'], axis=1)
#seprating input feature and output features
x = housing_df.drop(['Price'], axis=1)
y = housing_df['Price']
#training the model
reg = LinearRegression()
reg.fit(x,y)
#predicting the value
predicted_values = reg.predict(x)
print(predicted_values)
#Income: Avg. area income
#Age: Avg age of the houses
#Bedrooms: Avg No of bedrooms
#Rooms: Avg No of rooms
#Population: Population of the area
#Price: Average price in the area
#Address: THink of them as different ZIPcodes
#Also try to see if the model performance can be improved with
feature selection.
Explanation of import pandas as pd to predicting the values
#this will import the python pandas library and rename it as pd. pandas is used to manipulate with the tabular data.
import pandas as pd
#this will import the LinearRegression model from the sklearn framework
from sklearn.linear_model import LinearRegression
#loading the dataset - this will load the dataset and store in the variable housing_df.
#since the dataset is csv, so, pd.read_csv function is used to load the data.
housing_df = pd.read_csv('USA_Housing.csv')
#dropping the address feature as it not not important in building the model
#since the address column is not a numerical feature, it is not required for buliding Linear Regression model.
#So, its being dropped.
housing_df = housing_df.drop(['Address'], axis=1)
#seprating input feature and output features
#For training the machine learning model, the input and output features should be given as the input to the model.
#input feature is stored in the variable x and output feature is stored in the variable y.
x = housing_df.drop(['Price'], axis=1)
#since the price needs to be predicted, so it has been made as the output feature
y = housing_df['Price']
#training the model
#creating the object from Linear Regression
#and giving x and y as the input to the model
#fit function will try to figure out the relation between x and y
reg = LinearRegression()
reg.fit(x,y)
#predicting the value
#once after training the model we can give the input to the model to get the predictions.
#predicted values are being stored in the variable predicted_values
predicted_values = reg.predict(x)
#printing the predicted values
print(predicted_values)
It is not required to find anything from income to address, These are the column names of the dataset. So, the description about all the columns in the dataset has been given.
housing_df.head()
Type the above given code in your Python IDE to see the dataset and its columns. The screenshot is also given for the reference. head() function will print top 5 rows of the dataset.
Also try to see if the model performance can be improved with feature selection
The score can be found after below given code.
from sklearn.metrics import mean_squared_error
score = mean_squared_error(y, predicted_values)
print(score)
Feature selection can be done as follows.
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
#calculating feature scores
model = SelectKBest(score_func=f_regression, k='all')
model.fit(x, y)
#printing feature scores
for x in range(len(model.scores_)):
print('Feature %d: %f' % (x, model.scores_[x]))
x_selected = housing_df[['Avg. Area Income', 'Avg. Area House Age','Area Population' ]]
reg = LinearRegression()
reg.fit(x_selected, y)
#predicting the values against the best features
pred_new = reg.predict(x_selected)
#score calculation
score1 = mean_squared_error(y, pred_new)
print(score1)
Execute the code in the python ide so see the print the score and score1, which are the scores with all the features and with the best features respectively.
The values are as follows:
score = 10219734313.253006
score1 = 25175920206.028957
And the final conclusion is performance will not improve after the feature selection.