Question

In: Computer Science

For Question 1-4, We will load 1000_Companies.csv dataset that contains data belongs to 1000 companies such...

For Question 1-4, We will load 1000_Companies.csv dataset that contains data belongs to 1000 companies such as R&D, administration and marketing spendings and location. We will use this data to build a machine learning based decision suppport system model to predict companies' profit.

Question 1: 10 Points (Load Data)

  • (A) Load the "1000_Companies.csv" dataset - 5 points
  • (B) Display the first and last 5 rows of this dataset - 5 points

In [ ]:

 

Question 2: 15 Points (Manipulate Data)

  • (A) Extract the independent (Feature Matrix) and dependent (target vector) variables. - 5 points

  • (B) Encode the categorical data following the following steps:

    1) Integer Encoding - 5 points

    2) One-Hot Encoding - 5 points

In [1]:

 
#(A)Extract the independent (Feature Matrix) and dependent (target vector) variables.
#(B)Encode the categorical data following the following steps
##1)Integer Encoding
##2) One-Hot Encoding

Question 3: 35 Points (Modeling)

  • (A) Split the dataset into the training and test sets. Hint: Use train_test_split(test_size=0.3, shuffle = False) - 5 points
  • (B) Use Linear Regression Modeling to train your model (Name your model as Model1_LRM) - 5 points
  • (C) Use the trained model (Model1_LRM) and the test dataset for prediction - 5 points
  • (D) Calculate the accuracy of your Model1_LRM model. Hint: Use r2_score from sklearn.metrics - 5 points
  • (E) Use Random Forest Regressor Modeling to train your model (Name your model Model2_RFR) - 5 points
  • (F) Use the trained model(Model2_RFR) and the test dataset for prediction - 5 points
  • (G) Calculate the accuracy of your Model2_RFR model. Hint: Use r2_score from sklearn.metrics - 5 points

In [2]:

 
#(A) Split the dataset into the training and test sets. Hint: Use train_test_split(test_size=0.3, shuffle = False)
#(B) Use Linear Regression Modeling to train your model (Name your model as Model1_LRM)
#(C) Use the trained model (Model1_LRM) and the test dataset for prediction
#(D) Calculate the accuracy of your Model1_LRM model. Hint: Use r2_score from sklearn.metrics

In [28]:

 
#(E) Use Random Forest Regressor Modeling to train your model (Name your model Model2_RFR)
#(F) Use the trained model(Model2_RFR) and the test dataset for prediction
#(G) Calculate the accuracy of your Model2_RFR model. Hint: Use r2_score from sklearn.metrics

Out[28]:

0.724613670616963

Solutions

Expert Solution

Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

importing the dataset

  • Question 1: Load Data

(A) Load the "1000_Companies.csv" dataset

(B) Display the first and last 5 rows of this dataset

# (A)
dataset = pd.read_csv('/content/profit_estimation_of_companies/1000_Companies.csv')

# (B)

print("First five rows")
dataset.head(5)

print("Last five rows")
dataset.tail(5)

Question 2: Manipulate Data

(A) Extract the independent (Feature Matrix) and dependent (target vector) variables. -

(B) Encode the categorical data following the following steps:

1) Integer Encoding -

2) One-Hot Encoding -

X = dataset.iloc[:,:-1].values #features
y = dataset.iloc[:,4].values #target

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# State column
labelencoder = LabelEncoder()
X[:,3] = labelencoder.fit_transform(X[:,3])
ct = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder = 'passthrough')
X = ct.fit_transform(X)

Question 3: Modeling

(A) Split the dataset into the training and test sets. Hint: Use train_test_split(test_size=0.3, shuffle = False)

(B) Use Linear Regression Modeling to train your model (Name your model as Model1_LRM)

(C) Use the trained model (Model1_LRM) and the test dataset for prediction

(D) Calculate the accuracy of your Model1_LRM model. Hint: Use r2_score from sklearn.metrics

(E) Use Random Forest Regressor Modeling to train your model (Name your model Model2_RFR)

(F) Use the trained model(Model2_RFR) and the test dataset for prediction

(G) Calculate the accuracy of your Model2_RFR model. Hint: Use r2_score from sklearn.metrics

Linear Regression:

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0, shuffle=False)

#Linear Regression Modelling

from sklearn.linear_model import LinearRegression

# train model
Model1_LRM = LinearRegression()
Model1_LRM.fit(X_train,y_train)

# predict results
y_pred = Model1_LRM.predict(X_test)

# accuracy of model
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

Random Forest Regressor:

#Random Forest Regressor

# train model
from sklearn.ensemble import RandomForestRegressor
Model2_RFR = RandomForestRegressor()
Model2_RFR.fit(X_train, y_train)

# predict results
y_pred = Model2_RFR.predict(X_test)

# accuracy of model
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

Related Solutions

Load “Lock5Data” into your R console. Load “OlympicMarathon” data set in “Lock5Data”. This data set contains...
Load “Lock5Data” into your R console. Load “OlympicMarathon” data set in “Lock5Data”. This data set contains population of all times to finish the 2008 Olympic Men’s Marathon. a) What is the population size? b) Now using “Minutes” column generate a random sample of size 5. c) Calculate the sample mean and record it (create a excel sheet or write a direct R program to record this) d) Continue steps (b) and (c) 10,000 time (that mean you have recorded 10,000...
QUESTION 1 in 1000 words discus or analyse the campus’ load profile? based on the given...
QUESTION 1 in 1000 words discus or analyse the campus’ load profile? based on the given information below, Context: Mookodi Enterprise is a business consultancy with 543 employees. They are based in the north of Johannesburg on a 10Ha campus. Mookodi connects to the national grid at 11kV and pays a seasonal industrial low voltage (LV) tariff of 141.67 cents in summer and 165.94 cents in winter. Thirty percent of Mookodi’s employees are based at their clients’ offices and are...
On Moodle, you will find a file labelled “Data for Question 4 Assignment 1”. It contains...
On Moodle, you will find a file labelled “Data for Question 4 Assignment 1”. It contains data on past students in this course. Under Midterm is information on whether past studentsgot an A grade (A−, A, A+) an F or D grade (D is a passing grade but most students need aC− for it to count towards their program) or Other (any grade in between). Under FinalExam is information on whether students got a D or F grade or anything...
You will be performing an analysis on a dataset that contains data on fertility and life...
You will be performing an analysis on a dataset that contains data on fertility and life expectancy for 198 different countries. All data is from the year 2013. The fertility numbers are the average number of children per woman in each of the countries. The life expectancy numbers are the average life expectancy in each of the countries. You will be turning in a paper that should include section headings, graphics and tables when appropriate and complete sentences which explain...
The data file contains displacement (in mm)-load (in N) data for a mechanical test that was...
The data file contains displacement (in mm)-load (in N) data for a mechanical test that was conducted on an unknown metal. The initial length and diameter of the specimen are also given. a. (5 pts.) Using the data and a computer program (such as Excel), create an engineering stress-engineering strain graph with proper labels. The stress axis should be in the units of MPa. You do not need to show your spreadsheet or software code used to make the graph....
The dataset in the file Lab11data.xlsx contains data on Crimini mushrooms. The factor variable is the...
The dataset in the file Lab11data.xlsx contains data on Crimini mushrooms. The factor variable is the weight of the mushroom in grams and the response variable is the total copper content in mg. 1. Plot Copper vs. Weight and describe. 2. Find least squares regression line and interpret slope in the words of the problem. 3. Find the coefficient of determination (R2) and interpret in context. 4. Find the correlation coefficient (R) and interpret in context. 5. Find and interpret...
The R library faraway contains the pima dataset. We will fit a model with test as...
The R library faraway contains the pima dataset. We will fit a model with test as a response and bmi (only) as a predictor to see the relationship between the odds of a patient showing signs of diabetes and his/her bmi. The odds o and probability p are related by: o = p/(1-p), p = o(1+o) Using the GLM function: a. Please estimate the amount of increase in the log(odds) when the bmi increases by 10. b. Give a 95%...
Please do these questions in the R language 1. Load the cars dataset into R. It...
Please do these questions in the R language 1. Load the cars dataset into R. It is a built-in dataset. 2. Do an str() to determine the number of observations and variables. Enter your answer as a comment. 3. Plot speed on x axis and distance on y axis. 4. Find the correlation between speed and distance. What does the magnitude and sign indicate? Enter your answer as a comment. 5. Build a linear regression model with speed as the...
The dataset ’anorexia’ in the MASS package in R-Studio contains data for an anorexia study. In...
The dataset ’anorexia’ in the MASS package in R-Studio contains data for an anorexia study. In the study, three treatments (Treat) were applied to groups of young female anorexia patients, and their weights before (Prewt) and after (Postwt) treatment were recorded. The three treatments adminstered were no treatment (Cont), Cognitive Behavioural treatment (CBT), and family treatment (FT). Determine at the 5% significance level if there is a difference in mean weight gain between those receiving no treatment and those receiving...
The dataset ’anorexia’ in the MASS package in R-Studio contains data for an anorexia study. In...
The dataset ’anorexia’ in the MASS package in R-Studio contains data for an anorexia study. In the study, three treat- ments (Treat) were applied to groups of young female anorexia patients, and their weights before (Prewt) and after (Postwt) treatment were recorded. The three treatments adminstered were no treatment (Cont), Cognitive Behavioural treatment (CBT), and family treatment (FT). Determine at the 5% significance level if there is a difference in mean weight gain between those receiving no treatment and those...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT