Question

In: Computer Science

Submit a processed dataset and Python or SAS script that has been used along with a...

Submit a processed dataset and Python or SAS script that has been used along with a short description of the steps you have been following.

Solutions

Expert Solution

Link for the unprocessed dataset: https://drive.google.com/file/d/1HQALy5DdsuT8jBNUNQd8Eq0KHgR3JnyG/view?usp=sharing

#python script for processing the data

import pandas as pd
import numpy as np
from functools import reduce

#Reading the dataset usingf pandas
df = pd.read_csv('Datasets\BL-Flickr-Images-Book.csv')
df.head()

#checking to see how many rows contian null values
print df['Edition Statement']
print df['Edition Statement'].isnull()

#Dropping coloumns that do not contain any information
to_drop = ['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']

df.drop(to_drop, inplace = True, axis = 1)

#We can use the DataFrame.info() method to give us some high level information about our dataframe, including its size, information about data types and memory usage.
df.info(memory_usage='deep')

#Setting up an unique identifier for each record instead of serial number
df = df.set_index('Identifier')

#Cleaning columns using the .apply function
#cleaning the data
#removing unwanted character from the date of publication column

unwanted_characters = ['[', ',', '-'] # removing the unwanted characters

def clean_dates(item):
dop= str(item.loc['Date of Publication'])
  
if dop == 'nan' or dop[0] == '[':
return np.NaN
  
for character in unwanted_characters:
if character in dop:
character_index = dop.find(character)
dop = dop[:character_index]
  
return dop

df['Date of Publication'] = df.apply(clean_dates, axis = 1)

#Cleaning the title column
def clean_title(title):
  
if title == 'nan':
return 'NaN'
  
if title[0] == '[':
title = title[1: title.find(']')]
  
if 'by' in title:
title = title[:title.find('by')]
elif 'By' in title:
title = title[:title.find('By')]
  
if '[' in title:
title = title[:title.find('[')]

title = title[:-2]
  
title = list(map(str.capitalize, title.split()))
return ' '.join(title)
  
df['Title'] = df['Title'].apply(clean_title)

#saving the processed dataframe into a csv
export_csv = df.to_csv ('processed_dataset.csv', index = None, header=True)

processed dataset: https://drive.google.com/file/d/1h10lnhShmMIquQl-MN25KrnmKPdwCak2/view?usp=sharing


Related Solutions

a) Submit a copy of your dataset along with a file that contains your answers to...
a) Submit a copy of your dataset along with a file that contains your answers to all of the following questions. b) What the mean and Standard Deviation (SD) of the Close column in your data set? c) If a person bought 1 share of Google stock within the last year, what is the probability that the stock on that day closed at less than the mean for that year? Hint: You do not want to calculate the mean to...
BUSI 230-Project 3 a) Submit a copy of your dataset along with a file that contains...
BUSI 230-Project 3 a) Submit a copy of your dataset along with a file that contains your answers to all of the following questions. b) What the mean and Standard Deviation (SD) of the Close column in your data set? Mean: 1198.26 Standard Deviation: 85.39 c) If a person bought 1 share of Google stock within the last year, what is the probability that the stock on that day closed at less than the mean for that year? Hint: You...
Write a Python script that will be used as a calculator to allow simple arithmetic computation...
Write a Python script that will be used as a calculator to allow simple arithmetic computation of two input integer numbers. Requirements: a. The script should first allow the user to select if they want to add, subtract, multiply or divide two integer digits. b. The script should pass the user inputs to a function using arguments for computation and the results return to the main part of the script. c. The return value will be tested to determine if...
SUBMIT ASSOCIATED R SCRIPT WITH THIS ASSIGNMENT 1. There have been several reports that house plants...
SUBMIT ASSOCIATED R SCRIPT WITH THIS ASSIGNMENT 1. There have been several reports that house plants grow at different rates if they are exposed to music. To test this idea, you obtained 30 plants, and randomly chose 10 of plants each for exposure to rock music, exposure to classical music, and a control (no music). Several plants died over the course of the experiment. Below are heights of each plant after the 10 weeks (in cm). Run an Fmax test,...
Python 3 Script: A company has classified its employees as follows.
Python 3 Script: A company has classified its employees as follows.Managers Hourly workersCommission workers Pieceworkers- who receive a fixed weekly salary- who receive a fixed hourly wage for up to the first 40 hours they work and“time-and-a-half”, i.e. 1.5 times their hourly wage, for overtime hours worked), - who receive $250 plus 5.7% of their gross weekly sales)- who receive a fixed amount of money per item for each of the items theyProduce. Each pieceworker in this company works on...
make a python script that has the following typed: school = “cochise college” place = “online”...
make a python script that has the following typed: school = “cochise college” place = “online” Then do the following with the given parameters.Set the script to have a function that takes two parameters, school and place, have your function return the first 5 characters of each parameter value, utilizing a slice. Change the script to have a function that takes two parameters, school and place, have your function return the first character of each parameter, utilizing an index value....
Machine Learning do using python on jupyter notebook 1. Linear Regression Dataset used: Diabetes from sklearn...
Machine Learning do using python on jupyter notebook 1. Linear Regression Dataset used: Diabetes from sklearn You are asked to solve a regression problem in the Diabetes dataset. Please review the Diabetes dataset used before creating a program to decide which attributes will be used in the regression process. please use the cross-validation step to produce the best evaluation of the model. All you have to do is • Perform linear regression using the OLS (Ordinary Least Square) method (sklearn.linear_model.LinearRegression)...
After glycolysis has been completed (and before pyruvate is processed) most of the usable energy from...
After glycolysis has been completed (and before pyruvate is processed) most of the usable energy from the original glucose molecule is contained in ___ molecules. a. Acetyl-Coa b. ATP c. NADH and FADH2 d. CO2 e. pyruvate
] Landfill A has been used for 25 yr, while landill B has been used for...
] Landfill A has been used for 25 yr, while landill B has been used for 10 yr. Landfill A has ____________ pH, ____________ COD, and ________ nitrate than B
Patient dataset from a hospital has been taken to Identify whether the patient has heart disease...
Patient dataset from a hospital has been taken to Identify whether the patient has heart disease or not. Dataset contains noisy data and some outliers present in it, for that dataset choose any of the suitable data preprocessing tasks and also tell how outliers or noisy data removed from that dataset.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT