In: Computer Science
For Question 1-4, We will load 1000_Companies.csv dataset that contains data belongs to 1000 companies such as R&D, administration and marketing spendings and location. We will use this data to build a machine learning based decision suppport system model to predict companies' profit.
Question 1: 10 Points (Load Data)
In [ ]:
Question 2: 15 Points (Manipulate Data)
(A) Extract the independent (Feature Matrix) and dependent (target vector) variables. - 5 points
(B) Encode the categorical data following the following steps:
1) Integer Encoding - 5 points
2) One-Hot Encoding - 5 points
In [1]:
#(A)Extract the independent (Feature Matrix) and dependent (target vector) variables.
#(B)Encode the categorical data following the following steps
##1)Integer Encoding
##2) One-Hot Encoding
Question 3: 35 Points (Modeling)
In [2]:
#(A) Split the dataset into the training and test sets. Hint: Use train_test_split(test_size=0.3, shuffle = False)
#(B) Use Linear Regression Modeling to train your model (Name your model as Model1_LRM)
#(C) Use the trained model (Model1_LRM) and the test dataset for prediction
#(D) Calculate the accuracy of your Model1_LRM model. Hint: Use r2_score from sklearn.metrics
In [28]:
#(E) Use Random Forest Regressor Modeling to train your model (Name your model Model2_RFR)
#(F) Use the trained model(Model2_RFR) and the test dataset for prediction
#(G) Calculate the accuracy of your Model2_RFR model. Hint: Use r2_score from sklearn.metrics
Out[28]:
0.724613670616963
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
importing the dataset
(A) Load the "1000_Companies.csv" dataset
(B) Display the first and last 5 rows of this dataset
# (A)
dataset = pd.read_csv('/content/profit_estimation_of_companies/1000_Companies.csv')
# (B)
print("First five rows")
dataset.head(5)
print("Last five rows")
dataset.tail(5)
Question 2: Manipulate Data
(A) Extract the independent (Feature Matrix) and dependent (target vector) variables. -
(B) Encode the categorical data following the following steps:
1) Integer Encoding -
2) One-Hot Encoding -
X = dataset.iloc[:,:-1].values #features
y = dataset.iloc[:,4].values #target
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
# State column
labelencoder = LabelEncoder()
X[:,3] = labelencoder.fit_transform(X[:,3])
ct = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder = 'passthrough')
X = ct.fit_transform(X)
Question 3: Modeling
(A) Split the dataset into the training and test sets. Hint: Use train_test_split(test_size=0.3, shuffle = False)
(B) Use Linear Regression Modeling to train your model (Name your model as Model1_LRM)
(C) Use the trained model (Model1_LRM) and the test dataset for prediction
(D) Calculate the accuracy of your Model1_LRM model. Hint: Use r2_score from sklearn.metrics
(E) Use Random Forest Regressor Modeling to train your model (Name your model Model2_RFR)
(F) Use the trained model(Model2_RFR) and the test dataset for prediction
(G) Calculate the accuracy of your Model2_RFR model. Hint: Use r2_score from sklearn.metrics
Linear Regression:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0, shuffle=False)
#Linear Regression Modelling
from sklearn.linear_model import LinearRegression
# train model
Model1_LRM = LinearRegression()
Model1_LRM.fit(X_train,y_train)
# predict results
y_pred = Model1_LRM.predict(X_test)
# accuracy of model
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
Random Forest Regressor:
#Random Forest Regressor
# train model
from sklearn.ensemble import RandomForestRegressor
Model2_RFR = RandomForestRegressor()
Model2_RFR.fit(X_train, y_train)
# predict results
y_pred = Model2_RFR.predict(X_test)
# accuracy of model
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)