In: Statistics and Probability
Consider a relational dataset and specify your input and output variables, then:
To Predict Mileage per gallon performances of various
cars¶
Libraries
Importing numpy, pandas, matplotlib, and seaborn libraries. And
then setting %matplotlib inline
NumPy is a Linear Algebra Library for Python.Almost all of the libraries in the PyData Ecosystem rely on NumPy. Numpy is also incredibly fast,as it has bindings to C libraries
Pandas is an open source library.It is used for fast Analysis,Data Cleaning and Preparation.Using it we can work with wide variety of data
Matplotlib is a python 2D and 3D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
Seaborn is a statistical plotting library with beautiful default styles.it is designed to work with pandas data frame
%matplotlib inline is used to see plots in the jupyter notebook
In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Reading Data
Reading the auto-mpg csv file as a DataFrame called df.
In [10]:
df=pd.read_csv('auto-mpg.csv')
In [11]:
df.head(6)
Out[11]:
mpg cylinders displacement
horsepower weight acceleration
model_year origin name
0 18.0 8 307.0
130 3504 12.0 70
1 chevrolet chevelle malibu
1 15.0 8 350.0
165 3693 11.5 70
1 buick skylark 320
2 18.0 8 318.0
150 3436 11.0 70
1 plymouth satellite
3 16.0 8 304.0
150 3433 12.0 70
1 amc rebel sst
4 17.0 8 302.0
140 3449 10.5 70
1 ford torino
5 15.0 8 429.0
198 4341 10.0 70
1 ford galaxie 500
** The data is technical specifications of cars and it is
downloaded from UCI Machine Learning Repository.
My aim is to develop a model using Machine Learning Algorithm to Analyze Mileage per gallon performances of various cars
Data has 3 Multivalued discrete Attributes i.e cylinders, model_year & origin , 5 Continuous Attributes i.e mpg, displacement, horsepower, weight & acceleration and a String Attribute as car name
Some samples of auto-mpg data contain missing values represented by '?' into their Attribute horsepower Column . updating those samples and replacing '?' by na values. And again Reading the auto-mpg csv file as a DataFrame called df1.
In [12]:
df1=pd.read_csv('auto-mpg.csv',na_values='?')
Pandas describe() is used to view some basic statistical details
like percentile, mean, std etc. of a data frame or a series of
numeric values.
In [13]:
df1.describe()
Out[13]:
mpg cylinders displacement
horsepower weight acceleration
model_year origin
count 398.000000 398.000000
398.000000 392.000000
398.000000 398.000000
398.000000 398.000000
mean 23.514573 5.454774
193.425879 104.469388
2970.424623 15.568090 76.010050
1.572864
std 7.815984 1.701004
104.269838 38.491160 846.841774
2.757689 3.697627 0.802055
min 9.000000 3.000000
68.000000 46.000000 1613.000000
8.000000 70.000000 1.000000
25% 17.500000 4.000000
104.250000 75.000000
2223.750000 13.825000 73.000000
1.000000
50% 23.000000 4.000000
148.500000 93.500000
2803.500000 15.500000 76.000000
1.000000
75% 29.000000 8.000000
262.000000 126.000000
3608.000000 17.175000 79.000000
2.000000
max 46.600000 8.000000
455.000000 230.000000
5140.000000 24.800000 82.000000
3.000000
** Pandas dataframe.info() function is used to get a concise
summary of the dataframe
In [14]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg 398 non-null float64
cylinders 398 non-null int64
displacement 398 non-null float64
horsepower 392 non-null float64
weight 398 non-null int64
acceleration 398 non-null float64
model_year 398 non-null int64
origin 398 non-null int64
name 398 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB
** From info() we come to know that total entries in the data frame
is 398 . Here the attribute 'horsepower' has only 392 non null
entries thus remaining entries are 'missing' . Also the attribute
name is of type object and rest of the attributes contain numeric
values
Exploratory Data Analysis
Using seaborn to create a Scatter Plot to get the information about
the values of the Attribute 'horsepower'.
In [15]:
plt.figure(figsize=(10,10))
sns.scatterplot(x=df1.index,y=df1.horsepower)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0xca5535e048>
**we are using median to replace na values because scatter plot above shows that data contains outliers and in this case median is a better measure of central tendency
Since the column 'horsepower' contain missing values .So, we need to replace those missing values by the median of the remaining values of that column
In [16]:
median=df1.horsepower.median()
In [17]:
median
Out[17]:
93.5
In [18]:
df1.horsepower.fillna(median,inplace=True)
In [19]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg 398 non-null float64
cylinders 398 non-null int64
displacement 398 non-null float64
horsepower 398 non-null float64
weight 398 non-null int64
acceleration 398 non-null float64
model_year 398 non-null int64
origin 398 non-null int64
name 398 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB
Attribute 'name' contain non numeric data values. thus, It is not
relevant for 'Data Modeling' but it is relevant for 'Data Analysis'
.So, we should remove it from the data frame and create a new data
frame .
In [20]:
df2=df1.drop('name',axis=1)
After Data Cleaning step, checking Data frame . Since it doesn't
contain any missing value and non numeric attribute. thus,data
frame is in proper format
In [21]:
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
mpg 398 non-null float64
cylinders 398 non-null int64
displacement 398 non-null float64
horsepower 398 non-null float64
weight 398 non-null int64
acceleration 398 non-null float64
model_year 398 non-null int64
origin 398 non-null int64
dtypes: float64(4), int64(4)
memory usage: 25.0 KB
Heatmap is used to show the behaviour of the Attributes in the
data
In [22]:
sns.heatmap(data=df2.corr(),annot=True)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0xca5545f4a8>
**From above correlation matrix,We can vizualize that there are strong correlations between the attributes mpg & displacement, mpg & weight,cylinders & displacement,cylinders & horsepower,cylinders & weight,displacement & horsepower,displacement & weight and horsepower & weight
In [23]:
df2.head()
Out[23]:
mpg cylinders displacement
horsepower weight acceleration
model_year origin
0 18.0 8 307.0
130.0 3504 12.0 70
1
1 15.0 8 350.0
165.0 3693 11.5 70
1
2 18.0 8 318.0
150.0 3436 11.0 70
1
3 16.0 8 304.0
150.0 3433 12.0 70
1
4 17.0 8 302.0
140.0 3449 10.5 70
1
From the Heatmap we can vizualize that there exists
'multi-collinearity' among the Attributes of the data
We'll need to convert categorical features to dummy variables , Otherwise our machine learning algorithm won't be able to directly take in those features as inputs. Column 'model_year' and 'origin' are of categorical type. So,we will add Dummy variables for these attributes
pd.get_dummies() :function which can turn a categorical variable into a series of zeros and ones,which makes them a lot easier to quantify and compare
drop_first=True : Since one of the columns can be generated completely from the others, and hence retaining this extra column does not add any new information for the modelling process, it will be good practice to always drop the first column
pd.concat() :function which can perform concatenation operations along an axis
dataframe.drop() :Pandas drop function allows us to drop/remove one or more columns from a dataframe.
inplace=True :this operation changes directly the content of a given data without making a copy.
dataframe.column_name.unique() :function returns unique values in the column
In [24]:
modelyear=pd.get_dummies(df2['model_year'],drop_first=True)
In [25]:
origin1=pd.get_dummies(df2['origin'],drop_first=True)
In [26]:
df3=pd.concat([df2,modelyear,origin1],axis=1)
In [27]:
len(list(df3.model_year.unique()))
Out[27]:
13
In [28]:
len(list(df3.origin.unique()))
Out[28]:
3
In [29]:
df3.drop(['origin','model_year'],axis=1,inplace=True)
In [30]:
df3.columns
Out[30]:
Index([ 'mpg', 'cylinders', 'displacement', 'horsepower',
'weight', 'acceleration', 71, 72,
73, 74, 75, 76,
77, 78, 79, 80,
81, 82, 2, 3],
dtype='object')
StandardScaler performs the task of Standardization. Usually a
dataset contains variables that are different in scale.
It will transform the data such that its distribution will have a mean value 0 and standard deviation of 1. Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.
In [31]:
from sklearn.preprocessing import StandardScaler
In [32]:
sc=StandardScaler()
Transforming the data into standard scale
We can perform two steps fitting and transforming dataset in single step using "fit_transform "
fit method is applied to the dataset to learns the model parameters (for example, mean and standard deviation). transform method is apllied to the dataset to get the transformed (scaled) dataset.
In [33]:
sc.fit_transform(df3)
C:\Users\shiv\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:645:
DataConversionWarning: Data with input dtype uint8, int64, float64
were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
C:\Users\shiv\Anaconda3\lib\site-packages\sklearn\base.py:464:
DataConversionWarning: Data with input dtype uint8, int64, float64
were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
Out[33]:
array([[-0.7064387 , 1.49819126, 1.0906037 , ...,
-0.29063493,
-0.46196822, -0.49764335],
[-1.09075062, 1.49819126, 1.5035143 , ..., -0.29063493,
-0.46196822, -0.49764335],
[-0.7064387 , 1.49819126, 1.19623199, ..., -0.29063493,
-0.46196822, -0.49764335],
...,
[ 1.08701694, -0.85632057, -0.56103873, ..., 3.44074261,
-0.46196822, -0.49764335],
[ 0.57460104, -0.85632057, -0.70507731, ..., 3.44074261,
-0.46196822, -0.49764335],
[ 0.95891297, -0.85632057, -0.71467988, ..., 3.44074261,
-0.46196822, -0.49764335]])
In [34]:
df3.head()
Out[34]:
mpg cylinders displacement
horsepower weight acceleration
71 72 73 74
75 76 77 78
79 80 81 82
2 3
0 18.0 8 307.0
130.0 3504 12.0 0
0 0 0 0
0 0 0 0
0 0 0 0 0
1 15.0 8 350.0
165.0 3693 11.5 0
0 0 0 0
0 0 0 0
0 0 0 0 0
2 18.0 8 318.0
150.0 3436 11.0 0
0 0 0 0
0 0 0 0
0 0 0 0 0
3 16.0 8 304.0
150.0 3433 12.0 0
0 0 0 0
0 0 0 0
0 0 0 0 0
4 17.0 8 302.0
140.0 3449 10.5 0
0 0 0 0
0 0 0 0
0 0 0 0 0
Training and Testing Data
In [35]:
y=df3['mpg']
In [36]:
X=df3.drop(['mpg'],axis=1)
In [37]:
X=sc.fit_transform(X)
C:\Users\shiv\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:645:
DataConversionWarning: Data with input dtype uint8, int64, float64
were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
C:\Users\shiv\Anaconda3\lib\site-packages\sklearn\base.py:464:
DataConversionWarning: Data with input dtype uint8, int64, float64
were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
In [38]:
X
Out[38]:
array([[ 1.49819126, 1.0906037 , 0.67311762, ...,
-0.29063493,
-0.46196822, -0.49764335],
[ 1.49819126, 1.5035143 , 1.58995818, ..., -0.29063493,
-0.46196822, -0.49764335],
[ 1.49819126, 1.19623199, 1.19702651, ..., -0.29063493,
-0.46196822, -0.49764335],
...,
[-0.85632057, -0.56103873, -0.53187283, ..., 3.44074261,
-0.46196822, -0.49764335],
[-0.85632057, -0.70507731, -0.66285006, ..., 3.44074261,
-0.46196822, -0.49764335],
[-0.85632057, -0.71467988, -0.58426372, ..., 3.44074261,
-0.46196822, -0.49764335]])
In [39]:
y=sc.fit_transform(np.array(y).reshape(-1,1))
Train Test Split
Using model_selection.train_test_split from sklearn to split the
data into training and testing sets.
In [40]:
from sklearn.model_selection import train_test_split
scikit-learn provides a helpful function for partitioning data,
train_test_split ,which splits out our data into a training set and
a test set
test_size : float, int or None, optional (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_size : float, int, or None, (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state : int, RandomState instance or None, optional(default=None)
If int, random_state is the seed used by the random number generator If RandomState instance, random_state is the random number generator If None, the random number generator is the RandomState instance used by np.random.
shuffle : boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as the class labels.
In [41]:
X_train, X_test, y_train, y_test = train_test_split( X, y,
test_size=0.2, random_state=42)
Training the Model
Ridge Regression is a remedial measure taken to alleviate
multicollinearity amongst regression predictor variables in a
model. It adds a small bias factor to the variables in order to
alleviate this problem
In [42]:
from sklearn.linear_model import Ridge
Creating an instance of a Ridge() model named as model
In [43]:
model=Ridge()
Fitting the training data set
In [44]:
model.fit(X_train,y_train)
Out[44]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)
Predicting Test Data
In [45]:
predictions=model.predict(X_test)
Evaluating Model
evaluate the model by checking its Accuracy
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination.Best possible score is 1.0, lower values are worse.
In [46]:
from sklearn.metrics import r2_score
In [47]:
r2_score(y_test,predictions)
Out[47]:
0.8417934448295403
Visualization
In [56]:
plt.scatter(y_test,predictions)
Out[56]:
<matplotlib.collections.PathCollection at 0xca5893df60>
An approximately linear model has been obtained!
Conclusion
After Analyzing the auto-mpg data, we can conclude that the model
that we have created using Ridge Regression is 84 % Accurate. That
means it indicates that the model explains most of the variability
of the response data around its mean.the fitted values would always
equal the observed values and, therefore, most of the data points
would fall on the fitted regression line.
In [ ]: