In: Computer Science
Instructions: Digital submission in PDF format (you can use Word
or Markdown to create these and then convert to PDF). Your
submission must include all R code.
I will be testing all of your code by copying and pasting into my R
session, so make sure it works. Your solutions should be a
narrative that incorporates R code (i.e. tell a story). R code by
itself is not sufficient. This must be entirely an original work
done by you. Copying someone else’s work (possibly found online) is
not acceptable.
Assignment:
1) In homework 2, you gave the name of a site you would like to scrape. Follow through with this (you may change the site if you wish). The point of this exercise will be to ultimately provide an interesting visualization of the data which needs to be included in your solution. For this problem, the data must be scraped. Provide why you chose this site, and how someone would find your visualization useful (in other words, justify why anyone would care about this).
Now, we’ll be scraping the following data from this website.
#Sraping a webpage in R
#Loading the rvest package library('rvest') #Specifying the url for desired website to be scraped url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature' #Reading the HTML code from the website webpage <- read_html(url)
# We will start by scraping the Rank field('.text-primary'), now that you're sure what you want to srap and visualize as written in the starting.
#Check for the CSS selector code thaat contains the data to be scrapped
#Using CSS selectors to scrape the rankings section rank_data_html <- html_nodes(webpage,'.text-primary') #Converting the ranking data to text rank_data <- html_text(rank_data_html) #Let's have a look at the rankings head(rank_data)
#When you have the information, ensure that it looks in the ideal arrangement. I am preprocessing my information to change over it to the numerical organization.
#Data-Preprocessing: Converting rankings to numerical rank_data<-as.numeric(rank_data) #Let's have another look at the rankings head(rank_data)
#Presently you can clear the selector segment and select every one of the titles. You can outwardly review that every one of the titles is chosen. Make any necessary increases and erasures with the assistance of your curser. I have done likewise here.
#Using CSS selectors to scrape the title section title_data_html <- html_nodes(webpage,'.lister-item-header a') #Converting the title data to text title_data <- html_text(title_data_html) #Let's have a look at the title head(title_data)
#In the accompanying code, I have done likewise for scratching – Description, Runtime, Genre, Rating, Metascore, Votes, Gross_Earning_in_Mil , Director and Actor information.
#Using CSS selectors to scrape the description section description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted') #Converting the description data to text description_data <- html_text(description_data_html) #Let's have a look at the description data head(description_data)
#Data-Preprocessing: removing '\n' description_data<-gsub("\n","",description_data) #Let's have another look at the description data head(description_data)
#Using CSS selectors to scrape the Movie runtime section runtime_data_html <- html_nodes(webpage,'.text-muted .runtime') #Converting the runtime data to text runtime_data <- html_text(runtime_data_html) #Let's have a look at the runtime head(runtime_data)
#Data-Preprocessing: removing mins and converting it to numerical runtime_data<-gsub(" min","",runtime_data) runtime_data<-as.numeric(runtime_data) #Let's have another look at the runtime data head(runtime_data)
#Using CSS selectors to scrape the Movie genre section genre_data_html <- html_nodes(webpage,'.genre') #Converting the genre data to text genre_data <- html_text(genre_data_html) #Let's have a look at the runtime head(genre_data)
#Data-Preprocessing: removing \n genre_data<-gsub("\n","",genre_data) #Data-Preprocessing: removing excess spaces genre_data<-gsub(" ","",genre_data) #taking only the first genre of each movie genre_data<-gsub(",.*","",genre_data) #Convering each genre from text to factor genre_data<-as.factor(genre_data) #Let's have another look at the genre data head(genre_data)
#Using CSS selectors to scrape the IMDB rating section rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong') #Converting the ratings data to text rating_data <- html_text(rating_data_html) #Let's have a look at the ratings head(rating_data)
#Data-Preprocessing: converting ratings to numerical rating_data<-as.numeric(rating_data) #Let's have another look at the ratings data head(rating_data)
#Using CSS selectors to scrape the votes section votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)') #Converting the votes data to text votes_data <- html_text(votes_data_html) #Let's have a look at the votes data head(votes_data)
#Data-Preprocessing: removing commas votes_data<-gsub(",","",votes_data) #Data-Preprocessing: converting votes to numerical votes_data<-as.numeric(votes_data) #Let's have another look at the votes data head(votes_data)
#Using CSS selectors to scrape the directors section directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)') #Converting the directors data to text directors_data <- html_text(directors_data_html) #Let's have a look at the directors data head(directors_data)
#Data-Preprocessing: converting directors data into factors directors_data<-as.factor(directors_data) #Using CSS selectors to scrape the actors section actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a') #Converting the gross actors data to text actors_data <- html_text(actors_data_html)
#Data-Preprocessing: converting actors data into factors actors_data<-as.factor(actors_data)
#Combining all the lists to form a data frame movies_df<-data.frame(Rank = rank_data, Title = title_data, Description = description_data, Runtime = runtime_data, Genre = genre_data, Rating = rating_data,Votes = votes_data,Director = directors_data, Actor = actors_data) #Structure of the data frame str(movies_df)
#Let's have a look at the actors data head(actors_data)
#Analysing the data using graphs
library('ggplot2') qplot(data = movies_df,Runtime,fill = Genre,bins = 30)
#Visulaize the data which movie from which genre had the longest runtime
ggplot(movies_df,aes(x=Runtime,y=Rating))+ geom_point(aes(size=Votes,col=Genre))