Question

In: Statistics and Probability

Consider a relational dataset and specify your input and output variables, then: Train the model using...

Consider a relational dataset and specify your input and output variables, then:

    1. Train the model using 80% of this dataset and suggest an appropriate GLM model output to input variables.
    1. Specify the significant variables on the output variable at the level of ?=0.05 and explore the related hypotheses test. Estimate the parameters of your model.
    1. Predict the output of the test dataset using the trained model. Provide the functional form of the optimal predictive model.
    1. Provide the confusion matrix and obtain the probability of correctness of predictions.

Solutions

Expert Solution

To Predict Mileage per gallon performances of various cars¶
Libraries
Importing numpy, pandas, matplotlib, and seaborn libraries. And then setting %matplotlib inline

NumPy is a Linear Algebra Library for Python.Almost all of the libraries in the PyData Ecosystem rely on NumPy. Numpy is also incredibly fast,as it has bindings to C libraries

Pandas is an open source library.It is used for fast Analysis,Data Cleaning and Preparation.Using it we can work with wide variety of data

Matplotlib is a python 2D and 3D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms

Seaborn is a statistical plotting library with beautiful default styles.it is designed to work with pandas data frame

%matplotlib inline is used to see plots in the jupyter notebook

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Reading Data
Reading the auto-mpg csv file as a DataFrame called df.

In [10]:
df=pd.read_csv('auto-mpg.csv')
In [11]:
df.head(6)
Out[11]:
mpg   cylinders   displacement   horsepower   weight   acceleration   model_year   origin   name
0   18.0   8   307.0   130   3504   12.0   70   1   chevrolet chevelle malibu
1   15.0   8   350.0   165   3693   11.5   70   1   buick skylark 320
2   18.0   8   318.0   150   3436   11.0   70   1   plymouth satellite
3   16.0   8   304.0   150   3433   12.0   70   1   amc rebel sst
4   17.0   8   302.0   140   3449   10.5   70   1   ford torino
5   15.0   8   429.0   198   4341   10.0   70   1   ford galaxie 500
** The data is technical specifications of cars and it is downloaded from UCI Machine Learning Repository.

My aim is to develop a model using Machine Learning Algorithm to Analyze Mileage per gallon performances of various cars

Data has 3 Multivalued discrete Attributes i.e cylinders, model_year & origin , 5 Continuous Attributes i.e mpg, displacement, horsepower, weight & acceleration and a String Attribute as car name

Some samples of auto-mpg data contain missing values represented by '?' into their Attribute horsepower Column . updating those samples and replacing '?' by na values. And again Reading the auto-mpg csv file as a DataFrame called df1.

In [12]:
df1=pd.read_csv('auto-mpg.csv',na_values='?')
Pandas describe() is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values.

In [13]:
df1.describe()
Out[13]:
mpg   cylinders   displacement   horsepower   weight   acceleration   model_year   origin
count   398.000000   398.000000   398.000000   392.000000   398.000000   398.000000   398.000000   398.000000
mean   23.514573   5.454774   193.425879   104.469388   2970.424623   15.568090   76.010050   1.572864
std   7.815984   1.701004   104.269838   38.491160   846.841774   2.757689   3.697627   0.802055
min   9.000000   3.000000   68.000000   46.000000   1613.000000   8.000000   70.000000   1.000000
25%   17.500000   4.000000   104.250000   75.000000   2223.750000   13.825000   73.000000   1.000000
50%   23.000000   4.000000   148.500000   93.500000   2803.500000   15.500000   76.000000   1.000000
75%   29.000000   8.000000   262.000000   126.000000   3608.000000   17.175000   79.000000   2.000000
max   46.600000   8.000000   455.000000   230.000000   5140.000000   24.800000   82.000000   3.000000
** Pandas dataframe.info() function is used to get a concise summary of the dataframe

In [14]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg 398 non-null float64
cylinders 398 non-null int64
displacement 398 non-null float64
horsepower 392 non-null float64
weight 398 non-null int64
acceleration 398 non-null float64
model_year 398 non-null int64
origin 398 non-null int64
name 398 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB
** From info() we come to know that total entries in the data frame is 398 . Here the attribute 'horsepower' has only 392 non null entries thus remaining entries are 'missing' . Also the attribute name is of type object and rest of the attributes contain numeric values

Exploratory Data Analysis
Using seaborn to create a Scatter Plot to get the information about the values of the Attribute 'horsepower'.

In [15]:
plt.figure(figsize=(10,10))
sns.scatterplot(x=df1.index,y=df1.horsepower)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0xca5535e048>

**we are using median to replace na values because scatter plot above shows that data contains outliers and in this case median is a better measure of central tendency

Since the column 'horsepower' contain missing values .So, we need to replace those missing values by the median of the remaining values of that column

In [16]:
median=df1.horsepower.median()
In [17]:
median
Out[17]:
93.5
In [18]:
df1.horsepower.fillna(median,inplace=True)
In [19]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg 398 non-null float64
cylinders 398 non-null int64
displacement 398 non-null float64
horsepower 398 non-null float64
weight 398 non-null int64
acceleration 398 non-null float64
model_year 398 non-null int64
origin 398 non-null int64
name 398 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB
Attribute 'name' contain non numeric data values. thus, It is not relevant for 'Data Modeling' but it is relevant for 'Data Analysis' .So, we should remove it from the data frame and create a new data frame .

In [20]:
df2=df1.drop('name',axis=1)
After Data Cleaning step, checking Data frame . Since it doesn't contain any missing value and non numeric attribute. thus,data frame is in proper format

In [21]:
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
mpg 398 non-null float64
cylinders 398 non-null int64
displacement 398 non-null float64
horsepower 398 non-null float64
weight 398 non-null int64
acceleration 398 non-null float64
model_year 398 non-null int64
origin 398 non-null int64
dtypes: float64(4), int64(4)
memory usage: 25.0 KB
Heatmap is used to show the behaviour of the Attributes in the data

In [22]:
sns.heatmap(data=df2.corr(),annot=True)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0xca5545f4a8>

**From above correlation matrix,We can vizualize that there are strong correlations between the attributes mpg & displacement, mpg & weight,cylinders & displacement,cylinders & horsepower,cylinders & weight,displacement & horsepower,displacement & weight and horsepower & weight

In [23]:
df2.head()
Out[23]:
mpg   cylinders   displacement   horsepower   weight   acceleration   model_year   origin
0   18.0   8   307.0   130.0   3504   12.0   70   1
1   15.0   8   350.0   165.0   3693   11.5   70   1
2   18.0   8   318.0   150.0   3436   11.0   70   1
3   16.0   8   304.0   150.0   3433   12.0   70   1
4   17.0   8   302.0   140.0   3449   10.5   70   1
From the Heatmap we can vizualize that there exists 'multi-collinearity' among the Attributes of the data

We'll need to convert categorical features to dummy variables , Otherwise our machine learning algorithm won't be able to directly take in those features as inputs. Column 'model_year' and 'origin' are of categorical type. So,we will add Dummy variables for these attributes

pd.get_dummies() :function which can turn a categorical variable into a series of zeros and ones,which makes them a lot easier to quantify and compare

drop_first=True : Since one of the columns can be generated completely from the others, and hence retaining this extra column does not add any new information for the modelling process, it will be good practice to always drop the first column

pd.concat() :function which can perform concatenation operations along an axis

dataframe.drop() :Pandas drop function allows us to drop/remove one or more columns from a dataframe.

inplace=True :this operation changes directly the content of a given data without making a copy.

dataframe.column_name.unique() :function returns unique values in the column

In [24]:
modelyear=pd.get_dummies(df2['model_year'],drop_first=True)
In [25]:
origin1=pd.get_dummies(df2['origin'],drop_first=True)
In [26]:
df3=pd.concat([df2,modelyear,origin1],axis=1)
In [27]:
len(list(df3.model_year.unique()))
Out[27]:
13
In [28]:
len(list(df3.origin.unique()))
Out[28]:
3
In [29]:
df3.drop(['origin','model_year'],axis=1,inplace=True)
In [30]:
df3.columns
Out[30]:
Index([ 'mpg', 'cylinders', 'displacement', 'horsepower',
'weight', 'acceleration', 71, 72,
73, 74, 75, 76,
77, 78, 79, 80,
81, 82, 2, 3],
dtype='object')
StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale.

It will transform the data such that its distribution will have a mean value 0 and standard deviation of 1. Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.

In [31]:
from sklearn.preprocessing import StandardScaler
In [32]:
sc=StandardScaler()
Transforming the data into standard scale

We can perform two steps fitting and transforming dataset in single step using "fit_transform "

fit method is applied to the dataset to learns the model parameters (for example, mean and standard deviation). transform method is apllied to the dataset to get the transformed (scaled) dataset.

In [33]:
sc.fit_transform(df3)
C:\Users\shiv\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:645: DataConversionWarning: Data with input dtype uint8, int64, float64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
C:\Users\shiv\Anaconda3\lib\site-packages\sklearn\base.py:464: DataConversionWarning: Data with input dtype uint8, int64, float64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
Out[33]:
array([[-0.7064387 , 1.49819126, 1.0906037 , ..., -0.29063493,
-0.46196822, -0.49764335],
[-1.09075062, 1.49819126, 1.5035143 , ..., -0.29063493,
-0.46196822, -0.49764335],
[-0.7064387 , 1.49819126, 1.19623199, ..., -0.29063493,
-0.46196822, -0.49764335],
...,
[ 1.08701694, -0.85632057, -0.56103873, ..., 3.44074261,
-0.46196822, -0.49764335],
[ 0.57460104, -0.85632057, -0.70507731, ..., 3.44074261,
-0.46196822, -0.49764335],
[ 0.95891297, -0.85632057, -0.71467988, ..., 3.44074261,
-0.46196822, -0.49764335]])
In [34]:
df3.head()
Out[34]:
mpg   cylinders   displacement   horsepower   weight   acceleration   71   72   73   74   75   76   77   78   79   80   81   82   2   3
0   18.0   8   307.0   130.0   3504   12.0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
1   15.0   8   350.0   165.0   3693   11.5   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2   18.0   8   318.0   150.0   3436   11.0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
3   16.0   8   304.0   150.0   3433   12.0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
4   17.0   8   302.0   140.0   3449   10.5   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Training and Testing Data
In [35]:
y=df3['mpg']
In [36]:
X=df3.drop(['mpg'],axis=1)
In [37]:
X=sc.fit_transform(X)
C:\Users\shiv\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:645: DataConversionWarning: Data with input dtype uint8, int64, float64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
C:\Users\shiv\Anaconda3\lib\site-packages\sklearn\base.py:464: DataConversionWarning: Data with input dtype uint8, int64, float64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
In [38]:
X
Out[38]:
array([[ 1.49819126, 1.0906037 , 0.67311762, ..., -0.29063493,
-0.46196822, -0.49764335],
[ 1.49819126, 1.5035143 , 1.58995818, ..., -0.29063493,
-0.46196822, -0.49764335],
[ 1.49819126, 1.19623199, 1.19702651, ..., -0.29063493,
-0.46196822, -0.49764335],
...,
[-0.85632057, -0.56103873, -0.53187283, ..., 3.44074261,
-0.46196822, -0.49764335],
[-0.85632057, -0.70507731, -0.66285006, ..., 3.44074261,
-0.46196822, -0.49764335],
[-0.85632057, -0.71467988, -0.58426372, ..., 3.44074261,
-0.46196822, -0.49764335]])
In [39]:
y=sc.fit_transform(np.array(y).reshape(-1,1))
Train Test Split
Using model_selection.train_test_split from sklearn to split the data into training and testing sets.

In [40]:
from sklearn.model_selection import train_test_split
scikit-learn provides a helpful function for partitioning data, train_test_split ,which splits out our data into a training set and a test set

test_size : float, int or None, optional (default=None)

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

train_size : float, int, or None, (default=None)

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, optional(default=None)

If int, random_state is the seed used by the random number generator If RandomState instance, random_state is the random number generator If None, the random number generator is the RandomState instance used by np.random.

shuffle : boolean, optional (default=True)

Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

stratify : array-like or None (default=None)

If not None, data is split in a stratified fashion, using this as the class labels.

In [41]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)
Training the Model
Ridge Regression is a remedial measure taken to alleviate multicollinearity amongst regression predictor variables in a model. It adds a small bias factor to the variables in order to alleviate this problem

In [42]:
from sklearn.linear_model import Ridge
Creating an instance of a Ridge() model named as model

In [43]:
model=Ridge()
Fitting the training data set

In [44]:
model.fit(X_train,y_train)
Out[44]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)
Predicting Test Data
In [45]:
predictions=model.predict(X_test)
Evaluating Model
evaluate the model by checking its Accuracy

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination.Best possible score is 1.0, lower values are worse.

In [46]:
from sklearn.metrics import r2_score
In [47]:
r2_score(y_test,predictions)
Out[47]:
0.8417934448295403
Visualization
In [56]:
plt.scatter(y_test,predictions)
Out[56]:
<matplotlib.collections.PathCollection at 0xca5893df60>

An approximately linear model has been obtained!

Conclusion
After Analyzing the auto-mpg data, we can conclude that the model that we have created using Ridge Regression is 84 % Accurate. That means it indicates that the model explains most of the variability of the response data around its mean.the fitted values would always equal the observed values and, therefore, most of the data points would fall on the fitted regression line.

In [ ]:


Related Solutions

Consider the dataset regarding the drop in light percent output (output) and two input variables –...
Consider the dataset regarding the drop in light percent output (output) and two input variables – bulb surface (qualitative) and length of the operation hours (quantitative). The data is available in a sheet named ‘Problem 3’. Answer the following. (a) Write down the overall model form if one wishes to build a second order model for each value of the qualitative variable (c) Build a regression model showing the 90% confidence ranges of the regression parameters. Write down the mean...
Consider the dataset between a quantitative input variable, ? and a quantitative response (output) variable, ?...
Consider the dataset between a quantitative input variable, ? and a quantitative response (output) variable, ? . Which of the following provides an optimal fit between them - a linear model, a complete quadratic model or a complete third order model? (Hint: You can use adjusted multiple coefficient of determination, ?2 to determine the optimal? model. Your answers below must be accompanied by appropriate computation in Excel)?2 value for the linear model = ________________ ?2 value for the quadratic model...
For the following system provide the input and output variables. Justify. *One Gadget in your home...
For the following system provide the input and output variables. Justify. *One Gadget in your home *An example of a socio-economic system *One system that you are using as a part of your work
Specify a model that estimates the quantity demanded for beef inFredericton. What type of variables...
Specify a model that estimates the quantity demanded for beef in Fredericton. What type of variables will you consider in your model? Be specific as it is possible.
Fitting a linear model using R a. Read the Toluca.txt dataset into R (this dataset can...
Fitting a linear model using R a. Read the Toluca.txt dataset into R (this dataset can be found on Canvas). Now fit a simple linear regression model with X = lotSize and Y = workHrs. Summarize the output from the model: the least square estimators, their standard errors, and corresponding p-values. b. Draw the scatterplot of Y versus X and add the least squares line to the scatterplot. c. Obtain the fitted values ˆyi and residuals ei . Print the...
The below provides information about the dataset. Input variables (based on physicochemical tests) Fixed acidity Numeric...
The below provides information about the dataset. Input variables (based on physicochemical tests) Fixed acidity Numeric Volatile acidity Numeric Citric acid Numeric Residual sugar Numeric Chlorides Numeric Free sulfur dioxide Numeric Total sulfur dioxide Numeric Density Numeric pH Numeric Sulphates Numeric Alcohol Numeric (%) Wine Type red or white Output variable (based on sensory data) Quality Score between 0 and 10 in ordinal Data set fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide...
Fundamentals of Programming USING JAVA Please copy here your source codes, including your input and output...
Fundamentals of Programming USING JAVA Please copy here your source codes, including your input and output screenshots. Please upload this document along with your source files (i.e., the .java files) on Blackboard by the due date. 1. (Display three messages) Write a program that displays Welcome to Java, Welcome to Computer Science, and Programming is fun. 2. (Convert feet into meters) Write a program that reads a number in feet, converts it to meters, and displays the result. One foot...
Formulate the following problem using the following steps. a. Define the decision variables. b. Specify the...
Formulate the following problem using the following steps. a. Define the decision variables. b. Specify the objective function. c. Specify the constraints and simplify them so that the left hand side of each constraint only contains terms involving the decision variables. ChemLabs uses raw materials I and II to produce two domestic cleaning solutions A and B. The daily availabilities of raw materials I and II are 150 and 145 units respectively. One unit of solution A consumes .5 unit...
Consider the following relational model for a basketball league: • Player (PlayerID, PName, Position, TeamID) •...
Consider the following relational model for a basketball league: • Player (PlayerID, PName, Position, TeamID) • Team (TeamID, TeamName, Venue) • Game (GameNo, Date, Time, HomeTeamID, AwayTeamID) • Record (GameNo, PlayerID, Points, Rebounds, Assists) In this basketball league, each team has a unique name and each player plays for only one team. One team has at least 10 players. Two teams (home team versus away team) participate in each game at home team’s venue. Each team meets all other teams...
5. Suppose that a data scientist have 200 observations, 300 input variables, and a categorical output...
5. Suppose that a data scientist have 200 observations, 300 input variables, and a categorical output variable. Here is his/her analysis procedure: Step 1. Find a small subset of good predictors that show fairly strong (univariate) association with the output variable. Step 2. Using the input variables selected in Step 1, build LDA (linear discriminant analysis) and QDA (quadratic discriminant analysis) models. Step 3. Perform 5-fold cross-validation for both LDA and QDA models with input variables selected in Step 1....
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT