In: Computer Science
Analyze used car inventory dataset using Python's pandas library - using DataFrame data structure¶
Dataset: UsedCarInventory_Assignment1.txt (available on Canvas)
This dataset shows used cars available for sale at a dealership. Each row represents a car record and columns tell information about each car. The first row in the dataset contains column headers.
You must use Pandas to complete all 10 tasks.
Obtaining the data
To get or load the dataset into the notebook, all I did was one trivial step. In Google Colab at the left-hand side of the notebook, you will find a “>” (greater than symbol). On clicking that you will find a tab with three options, out of which you have to select Files. Then you can easily upload your dataset with the help of the Upload option. No need to mount to the google drive or use any specific libraries just upload the data set and your job is done. This is how I got the dataset into my notebook
Scrubbing and Formatting
Formatting the data into a data frame
Since the data set was already in a CSV format. All I had to do is just format the data into a pandas data frame. This was done by using a pandas data frame method called (read_csv) by importing pandas library. The read_csv data frame method was used by passing the filename as an argument. And then by executing this, it converted the CSV file into a neatly organized pandas data frame format.
# Importing the required librariesimport pandas as pd import numpy as np import seaborn as sns #visualisation import matplotlib.pyplot as plt #visualisation %matplotlib inline sns.set(color_codes=True)# Loading the CSV file into a pandas dataframe.df = pd.read_csv(“CARS.csv”) df.head(5)
Determining instances and the number of features.
This data set has 428 instances and 15 features also called as rows and columns. The instances here represent different car brands such as BMW, Mercedes, Audi, and 35 more, features represent Make, Model, Type, Origin, Drive Train, MSRP, Invoice, Engine Size, Cylinders, Horsepower, MPG-City, MPG-Highway, Weight, Wheelbase, and Length of the car.
Removing irrelevant features.
I will remove some features such as Drive Train, Model, Invoice, Type, and Origin from this dataset. Because these features do not contribute to the prediction of price. As of now, I will remove the Drive Train, the Drive Train will not support for predicting the price of the car because most of the cars in this data set were front-wheel drive (52.8%) and the rest were rear-wheel and all-wheel drive.
Similarly, the model, type and origin are irrelevant and are not needed in this context, it’s the brand which is important not the model of the car, and when it comes to type of the car most of the cars were of type Sedan and I kept the weight and length features of the cars in which case I can easily determine whether if it’s an SUV, sedan or a truck. I will also be removing the Invoice feature of the car because I have the MSRP as the price I don’t need the invoice because having any one type of price of the car makes more sense and it prevents in leading in ambiguous results (because both MSRP and Invoice are very closely related and you cannot predict the MSRP given the invoice). Lastly, the origin of cars has nothing to do with the prediction rate so I had to remove it and most of the cars were originated from Europe.
# Removing irrelevant featuresdf = df.drop([‘Model’,’DriveTrain’,’Invoice’, ‘Origin’, ‘Type’], axis=1) df.head(5)
Exploratory Data Analysis
Identifying the type of data using info()
To identify the data types, I use the info method. The info method prints a summary of the data in the data frame along with its data types. Here, there are 428 entries (0–427 rows). The data frame after removing irrelevant columns comprises 10 columns. Here the Make, MSRP is of an object type whereas Engine size and Cylinders are of float type and Horsepower, MPG_City, MPG_Highway, Weight, Wheelbase and Length are of integer type. Hence there are 2 object types, 2 float types and 6 integer types of data present in the data frame.
# To identify the type of datadf.info()<class ‘pandas.core.frame.DataFrame’> RangeIndex: 428 entries, 0 to 427 Data columns (total 10 columns): Make 428 non-null object MSRP 428 non-null object EngineSize 428 non-null float64 Cylinders 426 non-null float64 Horsepower 428 non-null int64 MPG_City 428 non-null int64 MPG_Highway 428 non-null int64 Weight 428 non-null int64 Wheelbase 428 non-null int64 Length 428 non-null int64 dtypes: float64(2), int64(6), object(2) memory usage: 33.5+ KB
Finding the dimensions of the data frame
To get the number of rows and columns of the data frame, I used the shape method. The shape method gets the number of rows and the number of columns of the data frame. Here, there are 428 rows and 10 columns. Hence the shape method returns (428, 10). And to find the dimensions of the data frame I used ndim (dimension) method. This method prints the dimensions of the data frame. Here, the whole data frame is of 2 dimensional (rows and columns).
# Getting the number of instances and featuresdf.shape(428, 10)# Getting the dimensions of the data frame df.ndim2
Finding duplicate data.
This is a handy thing to perform on a data set because often there might be duplicate or redundant data in the data sets, to remove this I used the MSRP as a reference such that there cannot be more than two same MSRP prices of the car, it shows that few data are redundant because prices of the cars can never match very accurately. So before removing the duplicates, there were 428 rows and after removing there are 410 meaning that there were 18 duplicate data.
df = df.drop_duplicates(subset=’MSRP’, keep=’first’) df.count()Make 410 MSRP 410 EngineSize 410 Cylinders 408 Horsepower 410 MPG_City 410 MPG_Highway 410 Weight 410 Wheelbase 410 Length 410 dtype: int64
Finding the missing or null values.
Many times there might be a lot of missing values in the dataset. There are several approaches to deal with this scenario either we can drop those values or fill those values with the mean of that column. Here, 2 entries were having N/A in the Cylinders feature. This can be found by using the is_null( ) method which returns the null or missing values in the data frame. So rather than deleting those two entries, I filled those values with the mean of the cylinders columns and their value came as 6.0 each. I was able to find this while I was peeking at the first and last few rows of the data set. I think rather than deleting this is a good approach because every entry of data is vital. I found out that there were two values stored a NaN (Not a number) in the Cylinders features. So I printed them by using the slicing technique of mentioning their index.
# Finding the null valuesprint(df.isnull().sum())Make 0 MSRP 0 EngineSize 0 Cylinders 2 Horsepower 0 MPG_City 0 MPG_Highway 0 Weight 0 Wheelbase 0 Length 0 dtype: int64# Printing the null value rowsdf[240:242]
# Filling the rows with the mean of the columnval = df[‘Cylinders’].mean() df[‘Cylinders’][247] = round(val)val = df[‘Cylinders’].mean() df[‘Cylinders’][248]= round(val)
Converting the object values to an integer type.
While having a look at the data, the MSRP was stored as an object type. This is a serious problem because it is impossible to plot those values on a graph because it is a primary requirement that during plotting a graph all the values must be of type integer data. The author has stored, the MSRP in a different format ($36, 000) so I had to remove the formatting and then convert them to an integer.
# Removing the formattingdf[‘MSRP’] = [x.replace(‘$’, ‘’) for x in df[‘MSRP’]] df[‘MSRP’] = [x.replace(‘,’, ‘’) for x in df[‘MSRP’]]df[‘MSRP’]=pd.to_numeric(df[‘MSRP’],errors=’coerce’)
Detecting Outliers
An outlier is a point or set of points different from other points. Sometimes they can be very high or very low. It’s often a good idea to detect and remove the outliers. Because outliers are one of the primary reasons for resulting in a less accurate model. Hence it’s a good idea to remove them. I will perform the IQR score technique to detect and remove the outliers. Often outliers can be seen with visualizations using a box plot. Shown below is the box plot of MSRP. In the plot, you can find some points are outside the box they are none other than outliers. I referred the above outlier technique from towards data science article which can be found in the references section [3].
sns.boxplot(x=df[‘MSRP’])
Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 — Q1 print(IQR)MSRP 19086.50 EngineSize 1.55 Cylinders 2.00 Horsepower 85.00 MPG_City 4.00 MPG_Highway 5.00 Weight 872.25 Wheelbase 9.00 Length 16.00 dtype: float6df = df[~((df < (Q1–1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
After using the technique now as seen below the MSRP box plot contains no outlier points this is a big improvement. Previously there were over 15 points of outliers now I have removed those outliers.
sns.boxplot(x=df[‘MSRP’])
Performing a 5 number summary (min, lower quartile, median, upper quartile, max
Next step is to perform a 5-number summary for the numeric data. As discussed earlier the numeric data, in this case, are MSRP, Engine Size, Horsepower, Cylinders, Horsepower, MPG_City, MPG_Highway, Weight, Wheelbase, and Length. The five-number summary includes minimum, lower quartile, median, upper quartile, and the maximum values all these values can be obtained by using the describe method.
df.describe()
Plotting different features against one another.
Heat Maps
Heat Maps is a plot which is necessary when we need to find the dependent variables. One of the best ways to find the correlation between the features can be done using heat maps. As shown below the price feature (MSRP) has a strong correlation with Horsepower of 83% this is very important because the more the relationship between the variables the more accurate the model will be. This is how the correlation between the features can be found using heat maps. With the help of heat maps, I can use these related features in building my model.
# Plotting a heat mapplt.figure(figsize=(10,5)) c= df.corr() sns.heatmap(c,cmap=”BrBG”,annot=True)
Scatterplot between two related variables
I know the features especially MSRP and the Horsepower are more related. Since I have two related variables I used a scatter plot to show their relationship. Here the scatter plots are plotted between Horsepower and MSRP are as shown below. With the plot given below, we can easily draw a trend line during modeling. I can easily see a line of best fit in the plot. I have not included the scatter plot between MSRP and Engine Size or Cylinders the reason for this is that these data have comparatively less correlation with the MSRP than that of MSRP and Horsepower which is 83%. Because as seen above the correlation between MSRP and Engine Size is of 54% and that of MSRP and Cylinders is of 64% so there is no reason to plot these features.
# Plotting a scatter plotfig, ax = plt.subplots(figsize=(5,5)) ax.scatter(df[‘Horsepower’], df[‘MSRP’]) plt.title(‘Scatter plot between MSRP and Horsepower’) ax.set_xlabel(‘Horsepower’) ax.set_ylabel(‘MSRP’) plt.show()