Question

In: Computer Science

Submit a processed dataset and Python or SAS script that has been used along with a...

Submit a processed dataset and Python or SAS script that has been used along with a short description of the steps you have been following.

Solutions

Expert Solution

Link for the unprocessed dataset: https://drive.google.com/file/d/1HQALy5DdsuT8jBNUNQd8Eq0KHgR3JnyG/view?usp=sharing

#python script for processing the data

import pandas as pd
import numpy as np
from functools import reduce

#Reading the dataset usingf pandas
df = pd.read_csv('Datasets\BL-Flickr-Images-Book.csv')
df.head()

#checking to see how many rows contian null values
print df['Edition Statement']
print df['Edition Statement'].isnull()

#Dropping coloumns that do not contain any information
to_drop = ['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']

df.drop(to_drop, inplace = True, axis = 1)

#We can use the DataFrame.info() method to give us some high level information about our dataframe, including its size, information about data types and memory usage.
df.info(memory_usage='deep')

#Setting up an unique identifier for each record instead of serial number
df = df.set_index('Identifier')

#Cleaning columns using the .apply function
#cleaning the data
#removing unwanted character from the date of publication column

unwanted_characters = ['[', ',', '-'] # removing the unwanted characters

def clean_dates(item):
dop= str(item.loc['Date of Publication'])
  
if dop == 'nan' or dop[0] == '[':
return np.NaN
  
for character in unwanted_characters:
if character in dop:
character_index = dop.find(character)
dop = dop[:character_index]
  
return dop

df['Date of Publication'] = df.apply(clean_dates, axis = 1)

#Cleaning the title column
def clean_title(title):
  
if title == 'nan':
return 'NaN'
  
if title[0] == '[':
title = title[1: title.find(']')]
  
if 'by' in title:
title = title[:title.find('by')]
elif 'By' in title:
title = title[:title.find('By')]
  
if '[' in title:
title = title[:title.find('[')]

title = title[:-2]
  
title = list(map(str.capitalize, title.split()))
return ' '.join(title)
  
df['Title'] = df['Title'].apply(clean_title)

#saving the processed dataframe into a csv
export_csv = df.to_csv ('processed_dataset.csv', index = None, header=True)

processed dataset: https://drive.google.com/file/d/1h10lnhShmMIquQl-MN25KrnmKPdwCak2/view?usp=sharing


Related Solutions

a) Submit a copy of your dataset along with a file that contains your answers to...
a) Submit a copy of your dataset along with a file that contains your answers to all of the following questions. b) What the mean and Standard Deviation (SD) of the Close column in your data set? c) If a person bought 1 share of Google stock within the last year, what is the probability that the stock on that day closed at less than the mean for that year? Hint: You do not want to calculate the mean to...
BUSI 230-Project 3 a) Submit a copy of your dataset along with a file that contains...
BUSI 230-Project 3 a) Submit a copy of your dataset along with a file that contains your answers to all of the following questions. b) What the mean and Standard Deviation (SD) of the Close column in your data set? Mean: 1198.26 Standard Deviation: 85.39 c) If a person bought 1 share of Google stock within the last year, what is the probability that the stock on that day closed at less than the mean for that year? Hint: You...
SUBMIT ASSOCIATED R SCRIPT WITH THIS ASSIGNMENT 1. There have been several reports that house plants...
SUBMIT ASSOCIATED R SCRIPT WITH THIS ASSIGNMENT 1. There have been several reports that house plants grow at different rates if they are exposed to music. To test this idea, you obtained 30 plants, and randomly chose 10 of plants each for exposure to rock music, exposure to classical music, and a control (no music). Several plants died over the course of the experiment. Below are heights of each plant after the 10 weeks (in cm). Run an Fmax test,...
Python 3 Script: A company has classified its employees as follows.
Python 3 Script: A company has classified its employees as follows.Managers Hourly workersCommission workers Pieceworkers- who receive a fixed weekly salary- who receive a fixed hourly wage for up to the first 40 hours they work and“time-and-a-half”, i.e. 1.5 times their hourly wage, for overtime hours worked), - who receive $250 plus 5.7% of their gross weekly sales)- who receive a fixed amount of money per item for each of the items theyProduce. Each pieceworker in this company works on...
Machine Learning do using python on jupyter notebook 1. Linear Regression Dataset used: Diabetes from sklearn...
Machine Learning do using python on jupyter notebook 1. Linear Regression Dataset used: Diabetes from sklearn You are asked to solve a regression problem in the Diabetes dataset. Please review the Diabetes dataset used before creating a program to decide which attributes will be used in the regression process. please use the cross-validation step to produce the best evaluation of the model. All you have to do is • Perform linear regression using the OLS (Ordinary Least Square) method (sklearn.linear_model.LinearRegression)...
After glycolysis has been completed (and before pyruvate is processed) most of the usable energy from...
After glycolysis has been completed (and before pyruvate is processed) most of the usable energy from the original glucose molecule is contained in ___ molecules. a. Acetyl-Coa b. ATP c. NADH and FADH2 d. CO2 e. pyruvate
] Landfill A has been used for 25 yr, while landill B has been used for...
] Landfill A has been used for 25 yr, while landill B has been used for 10 yr. Landfill A has ____________ pH, ____________ COD, and ________ nitrate than B
i want to write bash script that generates syslog if my machine has been pinged this...
i want to write bash script that generates syslog if my machine has been pinged this is my code, but it is not full and not work as i want #!/bin/bash status=`echo "$?"` monitor=`sudo tcpdump -i eth0 icmp and icmp[icmptype]=icmp-echo -n` if [ "$status" -eq 0 ]; then sleep 5s pkill -f "$0" `echo "$monitor" | awk '{print $3}' fi
create a role-play script with a patient who has recently been diagnosed with cancer. Use at...
create a role-play script with a patient who has recently been diagnosed with cancer. Use at least 10 questions in the script that a behavioral health provider would use to assess the patient's lifestyle management needs. Explain the impact of well-being in human functioning and achieving life satisfaction and disease management. it is imperative that you cite this answer.
This question has been done before but when i submit i receive a type ValueError on...
This question has been done before but when i submit i receive a type ValueError on line 9 where it says index_second_qoutes + 1. Could you please help or update code if needed. Thank U. (Code has been attached at the the end). The votes are in… and it's up to you to make sure the correct winner is announced! You've been given a CSV file called nominees.csv, which contains the names of various movies nominated for a prize, and...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT