Question

In: Computer Science

Write a python program to finish a big-data processing task --- finding out most frequently used...

Write a python program to finish a big-data processing task --- finding out most frequently used words on Wikipedia pages.

The execution of the program generates a list of distinct words used in the wikipedia pages and the number of occurrences of each word on these web pages. The words are sorted by the number of occurrences in ascending order. The following is a sample of output generated for 4 Wikipedia pages.

126 that
128 by
133 as
149 or
160 for
164 is
189 on
191 from
345 to
375 advertising
443 a
473 and
480 in
677 of
1080 the

Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In the project, you need to analyze all the pages for the Wikipedia entries with two capital letters. For example, the Wikipedia page for entry "AC" is https://en.wikipedia.org/wiki/AC . Use urllib or urllib2 library to download a page.


A HTML page has HTML tags, which should be removed before the analysis. Use BeautifulSoup library to convert a text from HTML format to text format.

Solutions

Expert Solution

from bs4 import BeautifulSoup,Comment
import urllib3
import itertools
import requests

# to install these requirements do pip3 install bs4,urllib3,requests

def textToextract(passableValue):
    textDict = {"that":0,"by":0,"as":0,"or":0,"for":0,"is":0,"on":0,"from":0,"to":0}
    # we can add as many string we need to search in the above dictionary
    textList = textDict.keys()
    # this creats a list of all the keys in textDict

    # this for loops through the text we get from get_text and then gets the count of the keys we have in textDict
    for key in textList:
        textDict[key] = passableValue.count(key)
        # getting the count of the string we need to search in the text we get after extracting from get_text in the main for loop
    print(textDict)

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
#this to disable warning related urllib which is unauthorized GET which stops the program

alphabets = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
combinations = list(map("".join,(itertools.permutations(alphabets,2))))
""" this takes the alphabets list and then
itertools.permutations calculates all the perimutattions possible of the alphabets list
map joins the results we get as two separate values
list convers the acquire result to a list
    the resultant list is something like ["AB","AC"....,"ZA"]
"""

http = urllib3.PoolManager()

for i in combinations:
    url = "https://en.wikipedia.org/wiki/" + i
    html = requests.get(url)
    r = html.content
    # we give a get request to get the contents from the url
    soup = BeautifulSoup(r, "html.parser")
    # beautifulsoup parses the html so its in readable format
    extracted = soup.find(id="bodyContent")
    # as we only need the body content of each page we find it as it has the id bodyContent
    for element in extracted(text=lambda text: isinstance(text, Comment)):
        element.extract()
    # this loop remove the comments from the result its not required but the result will be not accurate as it will search for the string we need in the comments too
    extracted = extracted.get_text()
    # this removes the html tag and we get only the text
    textToextract(extracted)


Related Solutions

PYTHON: Develop an algorithm for finding the most frequently occurring value in a list of numbers....
PYTHON: Develop an algorithm for finding the most frequently occurring value in a list of numbers. Use a sequence of coins. Place paper clips below each coin that count how many other coins of the same value are in the sequence. Give the pseudocode for an algorithm that yields the correct answer, and describe how using the coins and paper clips helped you find the algorithm. Please solve in python!
The most frequently used measures of central tendency for quantitative data are the mean and the...
The most frequently used measures of central tendency for quantitative data are the mean and the median. The following table shows civil service examination scores from 24 applicants to law enforcement jobs: 83        74        85        79 82        67        78        70 18        93        64        27 93        98        82        78 68        82        83        99 96        62        93        58 Using Excel, find the mean, standard deviation, and 5-number summary of this sample. Construct and paste a box plot depicting the 5-number summary. Does the dataset have outliers? If so, which one(s)? Would you prefer to use the mean or the medianas...
The most frequently used measures of central tendency for quantitative data are the mean and the...
The most frequently used measures of central tendency for quantitative data are the mean and the median. The following table shows civil service examination scores from 24 applicants to law enforcement jobs: 83 74 85 79 82 67 78 70 18 93 64 27 93 98 82 78 68 82 83 99 96 62 93 58 Using Excel, find the mean, standard deviation, and 5-number summary of this sample. Construct and paste a box plot depicting the 5-number summary. Does...
CODE IN PYTHON: Your task is to write a simple program that would allow a user...
CODE IN PYTHON: Your task is to write a simple program that would allow a user to compute the cost of a road trip with a car. User will enter the total distance to be traveled in miles along with the miles per gallon (MPG) information of the car he drives and the per gallon cost of gas. Using these 3 pieces of information you can compute the gas cost of the trip. User will also enter the number of...
((PYTHON)) Finish the calories_burned_functions.py program that we started in class. Take the original calories_burned program and...
((PYTHON)) Finish the calories_burned_functions.py program that we started in class. Take the original calories_burned program and rework it so that it uses two functions/function calls. Use the following file to get your program started: """ ''' Women: Calories = ((Age x 0.074) - (Weight x 0.05741) + (Heart Rate x 0.4472) - 20.4022) x Time / 4.184 ''' ''' Men: Calories = ((Age x 0.2017) + (Weight x 0.09036) + (Heart Rate x 0.6309) - 55.0969) x Time / 4.184...
Write a program to carry out the stated task: When one of the buttons is pressed,...
Write a program to carry out the stated task: When one of the buttons is pressed, the face changes to a smiling face [emoticon :-) ] or a frowning face [emoticon :-(]. Written in Visual Basic.
Write a Python program that print out the list of couples of prime numbers that are...
Write a Python program that print out the list of couples of prime numbers that are less than 50, but their sum is bigger than 40. For instance(29,13)or(37,17),etc. Your program should print all couples
Write a python program that will take in the number of call minutes used. Your program...
Write a python program that will take in the number of call minutes used. Your program will calculate the amount of charge for the first 200 minutes with a rate of $0.25; the remaining minutes with a rate of $0.35. The tax amount is calculated as 13% on top of the total. The customer could have a credit that also has to be considered in the calculation process. Finally, the program displays all this information. Below is a sample run:...
Write a program IN PYTHON of the JUPYTER NOOTBOOK Write a Python program that gets a...
Write a program IN PYTHON of the JUPYTER NOOTBOOK Write a Python program that gets a numeric grade (on a scale of 0-100) from the user and convert it to a letter grade based on the following table. A: 90% - 100% B 80% - 89% C 70% - 79% D 60% - 69% F <60% The program should be written so that if the user entered either a non-numeric input or a numeric input out of the 0-100 range,...
Course: Big Data Processing Why is the most basic Hadoop deployment at least using Virtualization, what...
Course: Big Data Processing Why is the most basic Hadoop deployment at least using Virtualization, what happens if we implement Hadoop Without Virtualization, is it possible? What are the consequences if you install Hadoop on a dedicated server?
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT