Question

In: Computer Science

Explain if lines from "Import pandas as pd" to "#predicting the value" answers the questions to...

Explain if lines from "Import pandas as pd" to "#predicting the value" answers the questions to question 1.

Also explain what is supposed to found using the lines "#income: avg. area income" to "#address."

What does this mean "#Also try to see if the model performance can be improved with feature selection."

#1. #Fit a linear regression model on data: USA_housing.csv to predict the Price of the house.
import pandas as pd

housing_df = pd.read_csv("USA_Housing.csv")

from sklearn.linear_model import LinearRegression

#loading the dataset
housing_df = pd.read_csv('USA_Housing.csv')

#dropping the address feature as it not not important in building the model
housing_df = housing_df.drop(['Address'], axis=1)

#seprating input feature and output features
x = housing_df.drop(['Price'], axis=1)
y = housing_df['Price']

#training the model
reg = LinearRegression()
reg.fit(x,y)

#predicting the value
predicted_values = reg.predict(x)
print(predicted_values)

#Income: Avg. area income
#Age: Avg age of the houses
#Bedrooms: Avg No of bedrooms
#Rooms: Avg No of rooms
#Population: Population of the area
#Price: Average price in the area
#Address: THink of them as different ZIPcodes

#Also try to see if the model performance can be improved with feature selection.

Solutions

Expert Solution

Explanation of import pandas as pd to predicting the values

#this will import the python pandas library and rename it as pd. pandas is used to manipulate with the tabular data.
import pandas as pd

#this will import the LinearRegression model from the sklearn framework
from sklearn.linear_model import LinearRegression

#loading the dataset - this will load the dataset and store in the variable housing_df.
#since the dataset is csv, so, pd.read_csv function is used to load the data.
housing_df = pd.read_csv('USA_Housing.csv')

#dropping the address feature as it not not important in building the model
#since the address column is not a numerical feature, it is not required for buliding Linear Regression model.
#So, its being dropped.
housing_df = housing_df.drop(['Address'], axis=1)

#seprating input feature and output features
#For training the machine learning model, the input and output features should be given as the input to the model.
#input feature is stored in the variable x and output feature is stored in the variable y.
x = housing_df.drop(['Price'], axis=1)

#since the price needs to be predicted, so it has been made as the output feature
y = housing_df['Price']

#training the model
#creating the object from Linear Regression
#and giving x and y as the input to the model
#fit function will try to figure out the relation between x and y
reg = LinearRegression()
reg.fit(x,y)

#predicting the value
#once after training the model we can give the input to the model to get the predictions.
#predicted values are being stored in the variable predicted_values
predicted_values = reg.predict(x)

#printing the predicted values
print(predicted_values)

It is not required to find anything from income to address, These are the column names of the dataset. So, the description about all the columns in the dataset has been given.

housing_df.head()

Type the above given code in your Python IDE to see the dataset and its columns. The screenshot is also given for the reference. head() function will print top 5 rows of the dataset.

Also try to see if the model performance can be improved with feature selection

  • Since, this is a regression problem as continuous value(price) needs to be predicted, the performance of the model will be interpreted by MSE(Mean Squared Error), RMSE(Root mean squared error) etc.
  • Columns are also called as the featuress. So, Feature selection means selecting the best features.
  • Hence, performance should be checked by doing the feature selection whether it is improving or degrading.

The score can be found after below given code.

from sklearn.metrics import mean_squared_error
score = mean_squared_error(y, predicted_values)
print(score)

Feature selection can be done as follows.

from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest

#calculating feature scores
model = SelectKBest(score_func=f_regression, k='all')
model.fit(x, y)


#printing feature scores
for x in range(len(model.scores_)):
    print('Feature %d: %f' % (x, model.scores_[x]))
    
x_selected = housing_df[['Avg. Area Income', 'Avg. Area House Age','Area Population' ]]

reg = LinearRegression()
reg.fit(x_selected, y)

#predicting the values against the best features
pred_new = reg.predict(x_selected)

#score calculation
score1 = mean_squared_error(y, pred_new)
print(score1)

Execute the code in the python ide so see the print the score and score1, which are the scores with all the features and with the best features respectively.

The values are as follows:

score = 10219734313.253006

score1 = 25175920206.028957

And the final conclusion is performance will not improve after the feature selection.


Related Solutions

I have an excel file imported into canopy (python) using import pandas as pd. The excel...
I have an excel file imported into canopy (python) using import pandas as pd. The excel file has headers titled: datetime created_at PM25 temperatureF dewpointF    humidityPCNT windMPH    wind_speedMPH wind_gustsMPH pressureIN precipIN these column headers all have thousands of data numbers under them. How could i find the average of all of the numbers in each column and plot them on 1 graph (line graph or scatter plot) Thank you.(please comment out your code)
Python Question I have created a dictionary shown below: import pandas as pd records_dict = {'FirstName':...
Python Question I have created a dictionary shown below: import pandas as pd records_dict = {'FirstName': [ 'Jim', 'John', 'Helen'], 'LastName': [ 'Robertson', 'Adams', 'Cooper'], 'Zipcode': [ '21801', '22321-1143', 'edskd-2134'], 'Phone': [ '555-555-5555', '4444444444', '323232'] } I have stored this dictionary in a data frame, like shown below: records = pd.DataFrame(records_dict) print(records) I am able to print the records just fine. My issue is, I want to eliminate, or put a blank space in, the values of the zipcode and...
Import substitution industrialization. Make some discussion questions like multiple choice questions with clear answers and explanation....
Import substitution industrialization. Make some discussion questions like multiple choice questions with clear answers and explanation. (import substitution industrialization) Your multiple choices questions should be your own creativity. NOTE: Avoid questions like what is isi? Thanks.
Answer these questions: 7 to 15 lines each: a-Why is Marx’s concept of surplus value unique...
Answer these questions: 7 to 15 lines each: a-Why is Marx’s concept of surplus value unique in political economy? b- Which system of thought is more pro-labour, mercantilism vs. physiocrats? defend. c-Compare the views of 2 political economists on the British corn laws. Which has a better view? d- What is the problem with Malthus’ concept of overpopulation in understanding the root causes of overpopulation? In your own words
The following function describes how the value of import of country i from country j, Fij...
The following function describes how the value of import of country i from country j, Fij is determined: ln Fij = −14.44 + 0.852 ln Gi + 0.178 ln Gj − 1.119 ln Dij , where Gi is GDP in country i, Gj GDP in country j, and Dij the distance between the two countries. Economists usually convert many aggregate measures of an economy in nature logs. According to the chain-rule d ln x = dx x , which measures...
As you answer these questions, be sure that you EXPLAIN YOUR ANSWERS IN DETAIL AND SHOW...
As you answer these questions, be sure that you EXPLAIN YOUR ANSWERS IN DETAIL AND SHOW YOUR CALCULATIONS. Remember you are showing off how much you know about economics and one or two sentences shows me you don’t know very much. Question One: Fiscal Policy Assume the United States economy has the following: • GDP is $18,500 billion down from $19,350 billion nine months ago. • Unemployment is at 6.8% up from 4.2% nine months ago. • Inflation is stable...
Understand the code and explain the code and answer the questions. Type your answers as comments....
Understand the code and explain the code and answer the questions. Type your answers as comments. #include #include using namespace std; // what is Color_Size and why it is at the end? enum Color {        Red, Yellow, Green, Color_Size }; // what is Node *next and why it is there? struct Node {        Color color;        Node *next; }; // explain the code below void addNode(Node* &first, Node* &last, const Color &c) {        if (first == NULL)...
From the following questions, please choose one of these three corresponding answers as a match: *...
From the following questions, please choose one of these three corresponding answers as a match: * Opinion modified as a result of a scope limitation * Opinion modified as a result of a misstatement * Unmodified opinion Questions: 1. The auditor did not receive management's personal financial statements. 2. During the period audited, the client changed its accounting policy for recognition of bad debt expense from the allowance method to the direct write-off method. 3. The auditor did not receive...
For the following 6 questions, select one of the answers from below and place the letter...
For the following 6 questions, select one of the answers from below and place the letter in the blank next to the question.       -       A.       B.       C.       D.       E.       F.    This method uses text as the data.       -       A.       B.       C.       D.       E.       F.    This method uses multiple sources of...
Questions Explain why decreases in income tax, savings and import spending all increase aggregate demand. How...
Questions Explain why decreases in income tax, savings and import spending all increase aggregate demand. How does a change in the interest rate influence the aggregate demand curve? If the value of the Canadian dollar decreases, what is the likely impact on aggregate demand? Why is a shift of the aggregate supply curve to the right like an outward shift of the production possibilities curve?
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT