In: Computer Science
Dataset
The scikit-learn sklearn.datasets module includes some small datasets for experimentation. In this project we will use the Boston house prices dataset to try and predict the median value of a home given several features of its neighborhood.
See the section on scikit-learn in Sergiy Kolesnikov’s blog article Datasets in Python to see how to load this dataset and examine it using pandas DataFrames.
Reminder: while you will use scikit-learn to obtain the dataset, your linear regression implementation must use NumPy directly.
Experiments
Run the following experiments in a Jupyter notebook, performing each action in a code cell and answering each question in a Markdown cell.
Load and examine the Boston dataset’s features, target values, and description.
Use sklearn.model_selection.train_test_split() to split the features and values into separate training and test sets. Use 80% of the original data as a training set, and 20% for testing.
Create a scatterplot of the training set showing the relationship between the feature LSTAT and the target value MEDV. Does the relationship appear to be linear?
With LSTAT as X and MEDV as t, use np.linalg.inv() to compute w for the training set. What is the equation for MEDV as a linear function of LSTAT?
Use w to add a line to your scatter plot from experiment (3). How well does the model appear to fit the training set?
Use w to find the response for each value of the LSTAT attribute in the test set, then compute the test MSE ? for the model.
Now add an x2 column to LSTAT’s x column in the training set, then repeat experiments (4), (5), and (6) for MEDV as a quadratic function of LSTAT. Does the quadratic polynomial do a better job of predicting the values in the test set?
Repeat experiment (4) with all 13 input features as X and using np.linalg.solve(). (See the Appendix to Linear regression in vector and matrix format for details.) Does adding additional features improve the performance on the test set compared to using only LSTAT?
Now add x2 columns for all 13 features, and repeat experiment (8). Does adding quadratic features improve the performance on the test set compared to using only linear features?
Compute the training MSE for experiments (8) and (9) and compare it to the test MSE. What explains the difference?
Repeat experiments (9) and (10), adding x3 columns in addition to the existing x and x2 columns for each feature. Does the cubic polynomial do a better job of predicting the values in the training set? Does it do a better job of predicting the values in the test set?
Answer :
I hope this answer is helpful to you, If you like the answer Please Upvote(Thums Up).I'm need of it, Thank you.