Question

In: Computer Science

Instructions: Digital submission in PDF format (you can use Word or Markdown to create these and...

Instructions: Digital submission in PDF format (you can use Word or Markdown to create these and then convert to PDF). Your submission must include all R code.
I will be testing all of your code by copying and pasting into my R session, so make sure it works. Your solutions should be a narrative that incorporates R code (i.e. tell a story). R code by itself is not sufficient. This must be entirely an original work done by you. Copying someone else’s work (possibly found online) is not acceptable.

Assignment:

1) In homework 2, you gave the name of a site you would like to scrape. Follow through with this (you may change the site if you wish). The point of this exercise will be to ultimately provide an interesting visualization of the data which needs to be included in your solution. For this problem, the data must be scraped. Provide why you chose this site, and how someone would find your visualization useful (in other words, justify why anyone would care about this).

Solutions

Expert Solution

Now, we’ll be scraping the following data from this website.

  • Rank
  • Title.
  • Description
  • Runtime
  • Genre
  • Rating
  • Metascore
  • Votes
  • Gross_Earning_in_Mil
  • Director
  • Actor
#Sraping a webpage in R
#Loading the rvest package
library('rvest')

#Specifying the url for desired website to be scraped
url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'

#Reading the HTML code from the website
webpage <- read_html(url)
# We will start by scraping the Rank field('.text-primary'), now that you're sure what you want to srap and visualize as written in the starting.
#Check for the CSS selector code thaat contains the data to be scrapped
#Using CSS selectors to scrape the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')

#Converting the ranking data to text
rank_data <- html_text(rank_data_html)

#Let's have a look at the rankings
head(rank_data)
#When you have the information, ensure that it looks in the ideal arrangement. I am preprocessing my information to change over it to the numerical organization.
#Data-Preprocessing: Converting rankings to numerical
rank_data<-as.numeric(rank_data)

#Let's have another look at the rankings
head(rank_data)
#Presently you can clear the selector segment and select every one of the titles. You can outwardly review that every one of the titles is chosen. Make any necessary increases and erasures with the assistance of your curser. I have done likewise here.
#Using CSS selectors to scrape the title section
title_data_html <- html_nodes(webpage,'.lister-item-header a')

#Converting the title data to text
title_data <- html_text(title_data_html)

#Let's have a look at the title
head(title_data)
#In the accompanying code, I have done likewise for scratching – Description, Runtime, Genre, Rating, Metascore, Votes, Gross_Earning_in_Mil , Director and Actor information.
#Using CSS selectors to scrape the description section
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')

#Converting the description data to text
description_data <- html_text(description_data_html)

#Let's have a look at the description data
head(description_data)
#Data-Preprocessing: removing '\n'
description_data<-gsub("\n","",description_data)

#Let's have another look at the description data 
head(description_data)
#Using CSS selectors to scrape the Movie runtime section
runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')

#Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)

#Let's have a look at the runtime
head(runtime_data)
#Data-Preprocessing: removing mins and converting it to numerical

runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

#Let's have another look at the runtime data
head(runtime_data)
#Using CSS selectors to scrape the Movie genre section
genre_data_html <- html_nodes(webpage,'.genre')

#Converting the genre data to text
genre_data <- html_text(genre_data_html)

#Let's have a look at the runtime
head(genre_data)
#Data-Preprocessing: removing \n
genre_data<-gsub("\n","",genre_data)

#Data-Preprocessing: removing excess spaces
genre_data<-gsub(" ","",genre_data)

#taking only the first genre of each movie
genre_data<-gsub(",.*","",genre_data)

#Convering each genre from text to factor
genre_data<-as.factor(genre_data)

#Let's have another look at the genre data
head(genre_data)
#Using CSS selectors to scrape the IMDB rating section
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

#Converting the ratings data to text
rating_data <- html_text(rating_data_html)

#Let's have a look at the ratings
head(rating_data)
#Data-Preprocessing: converting ratings to numerical
rating_data<-as.numeric(rating_data)

#Let's have another look at the ratings data
head(rating_data)
#Using CSS selectors to scrape the votes section
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

#Converting the votes data to text
votes_data <- html_text(votes_data_html)

#Let's have a look at the votes data
head(votes_data)
#Data-Preprocessing: removing commas
votes_data<-gsub(",","",votes_data)

#Data-Preprocessing: converting votes to numerical
votes_data<-as.numeric(votes_data)

#Let's have another look at the votes data
head(votes_data)
#Using CSS selectors to scrape the directors section
directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')

#Converting the directors data to text
directors_data <- html_text(directors_data_html)

#Let's have a look at the directors data
head(directors_data)
#Data-Preprocessing: converting directors data into factors
directors_data<-as.factor(directors_data)

#Using CSS selectors to scrape the actors section
actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')

#Converting the gross actors data to text
actors_data <- html_text(actors_data_html)
#Data-Preprocessing: converting actors data into factors
actors_data<-as.factor(actors_data)
#Combining all the lists to form a data frame
movies_df<-data.frame(Rank = rank_data, Title = title_data,

Description = description_data, Runtime = runtime_data,

Genre = genre_data, Rating = rating_data,Votes = votes_data,Director = directors_data, Actor = actors_data)
#Structure of the data frame

str(movies_df)
#Let's have a look at the actors data
head(actors_data)
#Analysing the data using graphs 
library('ggplot2')

qplot(data = movies_df,Runtime,fill = Genre,bins = 30)
#Visulaize the data which movie from which genre had the longest runtime
ggplot(movies_df,aes(x=Runtime,y=Rating))+
geom_point(aes(size=Votes,col=Genre))
 

Related Solutions

I want to use R markdown to do the following questions and render a pdf for...
I want to use R markdown to do the following questions and render a pdf for all the answers!!! Q1. Suppose we toss 4 coins (each having heads probability = (1/2). Let X denote the random variable: (number of heads) - (number of tails). (a) What is the range of X? (give exact upper and lower bounds along with a line of explanation) (b) What is the probability mass function of (c) What is the cumulative density function of X...
Instructions: You are not required to use R markdown for the lab assignment. Please include ALL...
Instructions: You are not required to use R markdown for the lab assignment. Please include ALL R commands you used to reach your answers in a word or pdf document. Also, report everything you are asked to do so. Problem 3 : In lab lecture notes and demo code, I simulated random samples from Exp(1) to verify classical central limit theorem numerically. I also stressed that no matter what type of random samples you use, the standardized partial sum Sn...
Instructions: You are not required to use R markdown for the lab assignment. Please include ALL...
Instructions: You are not required to use R markdown for the lab assignment. Please include ALL R commands you used to reach your answers in a word or pdf document. Also, report everything you are asked to do so. Problem 1 : Consider a binomial random variable X ∼ Bin(100, 0.01). 1. Report P(X = 7), P(X = 8), P(X = 9), try to use one ONE R command to return all these three values. 2. Find the probability P(0...
. You must use Excel (submit either a pdf, word or Excel file only). . You...
. You must use Excel (submit either a pdf, word or Excel file only). . You must identify the 5 steps (you must address each in detail). Problem: Use the given data to complete a t-test using Excel. Question: Is there a difference in group means between the number of words spelled correctly for two groups of fourth graders? Group Assignment Score 1 3 1 4 1 10 2 14 2 7 2 8 2 10 2 15 2 9...
Answer the following questions and upload to Canvas. Submit in Word or PDF format.  Show your work...
Answer the following questions and upload to Canvas. Submit in Word or PDF format.  Show your work and upload the Excel sheet as well. All the writing parts must be your original writing, don't quote, write in your own words. The following table presents the orders of Samson Company for the last 36 months (3 years). Month Order Year 1 Order Year 2 Order Year 3 January 502 614 712 February 408 592 698 March 491 584 686 April 456 532...
Instructions – PLEASE READ THEM CAREFULLY The Assignment must be submitted on Blackboard (WORD format only)...
Instructions – PLEASE READ THEM CAREFULLY The Assignment must be submitted on Blackboard (WORD format only) via allocated folder. Assignments submitted through email will not be accepted. Students are advised to make their work clear and well presented, marks may be reduced for poor presentation. This includes filling your information on the cover page. Students must mention question number clearly in their answer. Late submission will NOT be accepted. Avoid plagiarism, the work should be in your own words, copying...
Instructions – PLEASE READ THEM CAREFULLY The Assignment must be submitted on Blackboard (WORD format only)...
Instructions – PLEASE READ THEM CAREFULLY The Assignment must be submitted on Blackboard (WORD format only) via allocated folder. Assignments submitted through email will not be accepted. Students are advised to make their work clear and well presented, marks may be reduced for poor presentation. This includes filling your information on the cover page. Students must mention question number clearly in their answer. Late submission will NOT be accepted. Avoid plagiarism, the work should be in your own words, copying...
Use the correlation section of the PDF instructions provided. Do a complete and thorough write up...
Use the correlation section of the PDF instructions provided. Do a complete and thorough write up on the following correlation analysis. The variable Stay represents the average length of stay in days at a sample of hospitals across the country. Age is the average age of patients. Culture is a measure of the level of understanding of the importance of each employee in helping patients get well. Correlations Stay Age Culture Stay Pearson Correlation 1 .189* .327** Sig. (2-tailed) .045...
Write a 350 word response on the challenges of copyright and fair use in the digital...
Write a 350 word response on the challenges of copyright and fair use in the digital age .
Instruction: when you do the frequency distribution use 10 classes. You can attach a pdf or...
Instruction: when you do the frequency distribution use 10 classes. You can attach a pdf or word document file. This is another similar question in determining normality (section 6.2). Each part worth 4 points. Question: The numbers of branches of the 50 top banks are shown below:    67 84 80 77 97 59 62 37 33 42 36 54 18 12 19 33 49 24 25 22 24 29 9 21    21 24 31 17 15 21 13...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT