Question

In: Computer Science

Analyze used car inventory dataset using Python's pandas library - using DataFrame data structure¶ Dataset: UsedCarInventory_Assignment1.txt...

Analyze used car inventory dataset using Python's pandas library - using DataFrame data structure¶

Dataset: UsedCarInventory_Assignment1.txt (available on Canvas)

This dataset shows used cars available for sale at a dealership. Each row represents a car record and columns tell information about each car. The first row in the dataset contains column headers.

You must use Pandas to complete all 10 tasks.

Solutions

Expert Solution

Obtaining the data

To get or load the dataset into the notebook, all I did was one trivial step. In Google Colab at the left-hand side of the notebook, you will find a “>” (greater than symbol). On clicking that you will find a tab with three options, out of which you have to select Files. Then you can easily upload your dataset with the help of the Upload option. No need to mount to the google drive or use any specific libraries just upload the data set and your job is done. This is how I got the dataset into my notebook

Scrubbing and Formatting

Formatting the data into a data frame

Since the data set was already in a CSV format. All I had to do is just format the data into a pandas data frame. This was done by using a pandas data frame method called (read_csv) by importing pandas library. The read_csv data frame method was used by passing the filename as an argument. And then by executing this, it converted the CSV file into a neatly organized pandas data frame format.

# Importing the required librariesimport pandas as pd 
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline 
sns.set(color_codes=True)# Loading the CSV file into a pandas dataframe.df = pd.read_csv(“CARS.csv”)
df.head(5)

Determining instances and the number of features.

This data set has 428 instances and 15 features also called as rows and columns. The instances here represent different car brands such as BMW, Mercedes, Audi, and 35 more, features represent Make, Model, Type, Origin, Drive Train, MSRP, Invoice, Engine Size, Cylinders, Horsepower, MPG-City, MPG-Highway, Weight, Wheelbase, and Length of the car.

Removing irrelevant features.

I will remove some features such as Drive Train, Model, Invoice, Type, and Origin from this dataset. Because these features do not contribute to the prediction of price. As of now, I will remove the Drive Train, the Drive Train will not support for predicting the price of the car because most of the cars in this data set were front-wheel drive (52.8%) and the rest were rear-wheel and all-wheel drive.

Similarly, the model, type and origin are irrelevant and are not needed in this context, it’s the brand which is important not the model of the car, and when it comes to type of the car most of the cars were of type Sedan and I kept the weight and length features of the cars in which case I can easily determine whether if it’s an SUV, sedan or a truck. I will also be removing the Invoice feature of the car because I have the MSRP as the price I don’t need the invoice because having any one type of price of the car makes more sense and it prevents in leading in ambiguous results (because both MSRP and Invoice are very closely related and you cannot predict the MSRP given the invoice). Lastly, the origin of cars has nothing to do with the prediction rate so I had to remove it and most of the cars were originated from Europe.

# Removing irrelevant featuresdf = df.drop([‘Model’,’DriveTrain’,’Invoice’, ‘Origin’, ‘Type’], axis=1)
df.head(5)

Exploratory Data Analysis

Identifying the type of data using info()

To identify the data types, I use the info method. The info method prints a summary of the data in the data frame along with its data types. Here, there are 428 entries (0–427 rows). The data frame after removing irrelevant columns comprises 10 columns. Here the Make, MSRP is of an object type whereas Engine size and Cylinders are of float type and Horsepower, MPG_City, MPG_Highway, Weight, Wheelbase and Length are of integer type. Hence there are 2 object types, 2 float types and 6 integer types of data present in the data frame.

# To identify the type of datadf.info()<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 428 entries, 0 to 427 
Data columns (total 10 columns): 
Make 428 non-null object 
MSRP 428 non-null object 
EngineSize 428 non-null float64 
Cylinders 426 non-null float64 
Horsepower 428 non-null int64 
MPG_City 428 non-null int64 
MPG_Highway 428 non-null int64
Weight 428 non-null int64 
Wheelbase 428 non-null int64 
Length 428 non-null int64 
dtypes: float64(2), int64(6), object(2) 
memory usage: 33.5+ KB

Finding the dimensions of the data frame

To get the number of rows and columns of the data frame, I used the shape method. The shape method gets the number of rows and the number of columns of the data frame. Here, there are 428 rows and 10 columns. Hence the shape method returns (428, 10). And to find the dimensions of the data frame I used ndim (dimension) method. This method prints the dimensions of the data frame. Here, the whole data frame is of 2 dimensional (rows and columns).

# Getting the number of instances and featuresdf.shape(428, 10)# Getting the dimensions of the data frame
df.ndim2

Finding duplicate data.

This is a handy thing to perform on a data set because often there might be duplicate or redundant data in the data sets, to remove this I used the MSRP as a reference such that there cannot be more than two same MSRP prices of the car, it shows that few data are redundant because prices of the cars can never match very accurately. So before removing the duplicates, there were 428 rows and after removing there are 410 meaning that there were 18 duplicate data.

df = df.drop_duplicates(subset=’MSRP’, keep=’first’)
df.count()Make 410 
MSRP 410 
EngineSize 410 
Cylinders 408 
Horsepower 410 
MPG_City 410 
MPG_Highway 410 
Weight 410 
Wheelbase 410 
Length 410 
dtype: int64

Finding the missing or null values.

Many times there might be a lot of missing values in the dataset. There are several approaches to deal with this scenario either we can drop those values or fill those values with the mean of that column. Here, 2 entries were having N/A in the Cylinders feature. This can be found by using the is_null( ) method which returns the null or missing values in the data frame. So rather than deleting those two entries, I filled those values with the mean of the cylinders columns and their value came as 6.0 each. I was able to find this while I was peeking at the first and last few rows of the data set. I think rather than deleting this is a good approach because every entry of data is vital. I found out that there were two values stored a NaN (Not a number) in the Cylinders features. So I printed them by using the slicing technique of mentioning their index.

# Finding the null valuesprint(df.isnull().sum())Make 0 
MSRP 0 
EngineSize 0 
Cylinders 2 
Horsepower 0 
MPG_City 0 
MPG_Highway 0 
Weight 0 
Wheelbase 0 
Length 0 
dtype: int64# Printing the null value rowsdf[240:242]

# Filling the rows with the mean of the columnval = df[‘Cylinders’].mean()
df[‘Cylinders’][247] = round(val)val = df[‘Cylinders’].mean()
df[‘Cylinders’][248]= round(val)

Converting the object values to an integer type.

While having a look at the data, the MSRP was stored as an object type. This is a serious problem because it is impossible to plot those values on a graph because it is a primary requirement that during plotting a graph all the values must be of type integer data. The author has stored, the MSRP in a different format ($36, 000) so I had to remove the formatting and then convert them to an integer.

# Removing the formattingdf[‘MSRP’] = [x.replace(‘$’, ‘’) for x in df[‘MSRP’]] 
df[‘MSRP’] = [x.replace(‘,’, ‘’) for x in df[‘MSRP’]]df[‘MSRP’]=pd.to_numeric(df[‘MSRP’],errors=’coerce’)

Detecting Outliers

An outlier is a point or set of points different from other points. Sometimes they can be very high or very low. It’s often a good idea to detect and remove the outliers. Because outliers are one of the primary reasons for resulting in a less accurate model. Hence it’s a good idea to remove them. I will perform the IQR score technique to detect and remove the outliers. Often outliers can be seen with visualizations using a box plot. Shown below is the box plot of MSRP. In the plot, you can find some points are outside the box they are none other than outliers. I referred the above outlier technique from towards data science article which can be found in the references section [3].

sns.boxplot(x=df[‘MSRP’])

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 — Q1
print(IQR)MSRP           19086.50 
EngineSize         1.55 
Cylinders          2.00 
Horsepower        85.00 
MPG_City           4.00 
MPG_Highway        5.00 
Weight           872.25 
Wheelbase          9.00 
Length            16.00 
dtype: float6df = df[~((df < (Q1–1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

After using the technique now as seen below the MSRP box plot contains no outlier points this is a big improvement. Previously there were over 15 points of outliers now I have removed those outliers.

sns.boxplot(x=df[‘MSRP’])

Performing a 5 number summary (min, lower quartile, median, upper quartile, max

Next step is to perform a 5-number summary for the numeric data. As discussed earlier the numeric data, in this case, are MSRP, Engine Size, Horsepower, Cylinders, Horsepower, MPG_City, MPG_Highway, Weight, Wheelbase, and Length. The five-number summary includes minimum, lower quartile, median, upper quartile, and the maximum values all these values can be obtained by using the describe method.

df.describe()

Plotting different features against one another.

Heat Maps

Heat Maps is a plot which is necessary when we need to find the dependent variables. One of the best ways to find the correlation between the features can be done using heat maps. As shown below the price feature (MSRP) has a strong correlation with Horsepower of 83% this is very important because the more the relationship between the variables the more accurate the model will be. This is how the correlation between the features can be found using heat maps. With the help of heat maps, I can use these related features in building my model.

# Plotting a heat mapplt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap=”BrBG”,annot=True)

Scatterplot between two related variables

I know the features especially MSRP and the Horsepower are more related. Since I have two related variables I used a scatter plot to show their relationship. Here the scatter plots are plotted between Horsepower and MSRP are as shown below. With the plot given below, we can easily draw a trend line during modeling. I can easily see a line of best fit in the plot. I have not included the scatter plot between MSRP and Engine Size or Cylinders the reason for this is that these data have comparatively less correlation with the MSRP than that of MSRP and Horsepower which is 83%. Because as seen above the correlation between MSRP and Engine Size is of 54% and that of MSRP and Cylinders is of 64% so there is no reason to plot these features.

# Plotting a scatter plotfig, ax = plt.subplots(figsize=(5,5))
ax.scatter(df[‘Horsepower’], df[‘MSRP’])
plt.title(‘Scatter plot between MSRP and Horsepower’)
ax.set_xlabel(‘Horsepower’)
ax.set_ylabel(‘MSRP’)
plt.show()


Related Solutions

Create a pandas dataframe and then impute missing values . data = { 'test' : [1,2,3,4,10,15]...
Create a pandas dataframe and then impute missing values . data = { 'test' : [1,2,3,4,10,15] 'missing' : [1,2,4,None,5,7] } replace the missing values in the missing table column with mean values using mean imputation ============ i am trying like this but i am not getting correct output and getting confused please explain with proper output and explanation import pandas as pd pd.DataFrame(data) temp = pd.DataFrame(data).fillna(np.mean()) temp ['missing'] . fillna(temp['missing'].mean()) ================ i am too much confused please write proper program...
Using pandas Read in the movies.csv into a dataframe named movies, display the first 5 rows...
Using pandas Read in the movies.csv into a dataframe named movies, display the first 5 rows and answer * Use the filter method to select the column names that contain the exact string facebook [ ] * Use the count method to find the number of non-missing values for each column. [ ] * Display the count of missing values for each column
Load the USArrests sample dataset from the built-in datasets (data(USArrests)) into R using a dataframe (Note:...
Load the USArrests sample dataset from the built-in datasets (data(USArrests)) into R using a dataframe (Note: Row names are states, not numerical values!). Use the kmeans package to perform a clustering of the data with increasing values of k from 2 to 10 - you will need to decide whether or not to center/scale the observations - justify your choice. Plot the within-cluster sum of squares for each value of k - what is the optimal number of clusters? Use...
Assume you have the Pandas DataFrame data, with the following contents: our_columns_name column_A column_B column_C column_D...
Assume you have the Pandas DataFrame data, with the following contents: our_columns_name column_A column_B column_C column_D column_E our_index_name                                                    row_name_0               9        93        71    Hello       102 row_name_1              28        64        37       my        92 row_name_2              13        91        93     name       104 row_name_3              45        29        54       is        74 row_name_4               0        36        31    Jason        36 Each column has a dtype (data type). Which of the following could be set of dtypes for this DataFrame? Hint 1: None of the numeric values shows a decimal point. (A float...
Work these in R. Using library(resampledata) and the dataset Spruce to conduct a test to see...
Work these in R. Using library(resampledata) and the dataset Spruce to conduct a test to see if the mean difference in how much the seedling grew (in height) over the course of the study under these two treatments are significantly different from each other. Answer the following: a) Set up a hypothesis using appropriate notation. b)Find the value of the observed test statistic using R. c)Compute the P-value of the observed test statistic using a permutation distribution with N= 10^5-1....
Solve using PYTHON PROGRAMMING Write a script that reads a file “cars.csv”, into a pandas structure...
Solve using PYTHON PROGRAMMING Write a script that reads a file “cars.csv”, into a pandas structure and then print a. the first 3 rows and the last 3 of the dataset b. the 3 cars with the lowest average-mileage c. the 3 cars with the highest average-mileage. Solve using PYTHON PROGRAMMING
In R: Consider dataset “juul” from library “ISwR”. (juul is a built in data set) Are...
In R: Consider dataset “juul” from library “ISwR”. (juul is a built in data set) Are the means of igf1 equal among tanner groups at 5% level? Please use the six step process to test statistical hypotheses for this research problem. Note: You need to convert tanner from numeric to factor type and ignore all the NAs.
Solve using R. You will need library(resampledata) and the dataset FlightDelays. Conduct a hypothesis test to...
Solve using R. You will need library(resampledata) and the dataset FlightDelays. Conduct a hypothesis test to see whether there is a difference in the variances of flight delay length between the two airlines. 1) Set a hypothesis for this test using appropriate notation 2) Using R, find the value of the observed test statistic 3) Using R, compute the P-value of the observed test statistic using a permutation distribution with N=10^5-1 resamples. If possible use comments so that it is...
5. a. Analyze the Bread variable in the SandwichAnts dataset using aov() in R and interpret...
5. a. Analyze the Bread variable in the SandwichAnts dataset using aov() in R and interpret your results. The data may be found here: install.packages("Lock5Data") library(Lock5Data) data(SandwichAnts,package="Lock5Data") attach(SandwichAnts) b. State the linear model for this problem. Define all notation and model terms. c. Create the design matrix for this problem. d. Estimate model parameters for this problem using ? = (?T?)-1?T? e. Interpret the meaning of the estimates from part d. f. Rerun this problem using lm()in R. Interpret the...
To import the Carseats dataset into Rstudio: library("ISLR") data(Carseats) view(Carseats) Then, provide necessary codes for the...
To import the Carseats dataset into Rstudio: library("ISLR") data(Carseats) view(Carseats) Then, provide necessary codes for the following: a. Split the data into a training set and a test set. b. Fit a linear model using least squares on the training set to predict Sales using the entire collection of predictors. Report Cp , BIC, R2 , and RSS for this model c. Use the fitted model to predict responses for the test data and report the test error (RSS) obtained....
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT