Question

In: Computer Science

Write a python program to finish a big-data processing task --- finding out most frequently used...

Write a python program to finish a big-data processing task --- finding out most frequently used words on Wikipedia pages.

The execution of the program generates a list of distinct words used in the wikipedia pages and the number of occurrences of each word on these web pages. The words are sorted by the number of occurrences in ascending order. The following is a sample of output generated for 4 Wikipedia pages.

126 that
128 by
133 as
149 or
160 for
164 is
189 on
191 from
345 to
375 advertising
443 a
473 and
480 in
677 of
1080 the

Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In the project, you need to analyze all the pages for the Wikipedia entries with two capital letters. For example, the Wikipedia page for entry "AC" is https://en.wikipedia.org/wiki/AC . Use urllib or urllib2 library to download a page.


A HTML page has HTML tags, which should be removed before the analysis. Use BeautifulSoup library to convert a text from HTML format to text format.

Solutions

Expert Solution

from bs4 import BeautifulSoup,Comment
import urllib3
import itertools
import requests

# to install these requirements do pip3 install bs4,urllib3,requests

def textToextract(passableValue):
    textDict = {"that":0,"by":0,"as":0,"or":0,"for":0,"is":0,"on":0,"from":0,"to":0}
    # we can add as many string we need to search in the above dictionary
    textList = textDict.keys()
    # this creats a list of all the keys in textDict

    # this for loops through the text we get from get_text and then gets the count of the keys we have in textDict
    for key in textList:
        textDict[key] = passableValue.count(key)
        # getting the count of the string we need to search in the text we get after extracting from get_text in the main for loop
    print(textDict)

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
#this to disable warning related urllib which is unauthorized GET which stops the program

alphabets = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
combinations = list(map("".join,(itertools.permutations(alphabets,2))))
""" this takes the alphabets list and then
itertools.permutations calculates all the perimutattions possible of the alphabets list
map joins the results we get as two separate values
list convers the acquire result to a list
    the resultant list is something like ["AB","AC"....,"ZA"]
"""

http = urllib3.PoolManager()

for i in combinations:
    url = "https://en.wikipedia.org/wiki/" + i
    html = requests.get(url)
    r = html.content
    # we give a get request to get the contents from the url
    soup = BeautifulSoup(r, "html.parser")
    # beautifulsoup parses the html so its in readable format
    extracted = soup.find(id="bodyContent")
    # as we only need the body content of each page we find it as it has the id bodyContent
    for element in extracted(text=lambda text: isinstance(text, Comment)):
        element.extract()
    # this loop remove the comments from the result its not required but the result will be not accurate as it will search for the string we need in the comments too
    extracted = extracted.get_text()
    # this removes the html tag and we get only the text
    textToextract(extracted)


Related Solutions

The most frequently used measures of central tendency for quantitative data are the mean and the...
The most frequently used measures of central tendency for quantitative data are the mean and the median. The following table shows civil service examination scores from 24 applicants to law enforcement jobs: 83        74        85        79 82        67        78        70 18        93        64        27 93        98        82        78 68        82        83        99 96        62        93        58 Using Excel, find the mean, standard deviation, and 5-number summary of this sample. Construct and paste a box plot depicting the 5-number summary. Does the dataset have outliers? If so, which one(s)? Would you prefer to use the mean or the medianas...
The most frequently used measures of central tendency for quantitative data are the mean and the...
The most frequently used measures of central tendency for quantitative data are the mean and the median. The following table shows civil service examination scores from 24 applicants to law enforcement jobs: 83 74 85 79 82 67 78 70 18 93 64 27 93 98 82 78 68 82 83 99 96 62 93 58 Using Excel, find the mean, standard deviation, and 5-number summary of this sample. Construct and paste a box plot depicting the 5-number summary. Does...
CODE IN PYTHON: Your task is to write a simple program that would allow a user...
CODE IN PYTHON: Your task is to write a simple program that would allow a user to compute the cost of a road trip with a car. User will enter the total distance to be traveled in miles along with the miles per gallon (MPG) information of the car he drives and the per gallon cost of gas. Using these 3 pieces of information you can compute the gas cost of the trip. User will also enter the number of...
Write a python program that will take in the number of call minutes used. Your program...
Write a python program that will take in the number of call minutes used. Your program will calculate the amount of charge for the first 200 minutes with a rate of $0.25; the remaining minutes with a rate of $0.35. The tax amount is calculated as 13% on top of the total. The customer could have a credit that also has to be considered in the calculation process. Finally, the program displays all this information. Below is a sample run:...
Write a program IN PYTHON of the JUPYTER NOOTBOOK Write a Python program that gets a...
Write a program IN PYTHON of the JUPYTER NOOTBOOK Write a Python program that gets a numeric grade (on a scale of 0-100) from the user and convert it to a letter grade based on the following table. A: 90% - 100% B 80% - 89% C 70% - 79% D 60% - 69% F <60% The program should be written so that if the user entered either a non-numeric input or a numeric input out of the 0-100 range,...
Please write in Python code Write a program that stores the following data in a tuple:...
Please write in Python code Write a program that stores the following data in a tuple: 54,76,32,14,29,12,64,97,50,86,43,12 The program needs to display a menu to the user, with the following 4 options: 1 – Display minimum 2 – Display maximum 3 – Display total 4 – Display average 5 – Quit Make your program loop back to this menu until the user chooses option 5. Write code for all 4 other menu choices
In Python write a program that calculates and prints out bills of the city water company....
In Python write a program that calculates and prints out bills of the city water company. The water rates vary, depending on whether the bill is for home use, commercial use, or industrial use. A code of r means residential use, a code of c means commercial use, and a code of i means industrial use. Any other code should be treated as an error. The water rates are computed as follows:Three types of customers and their billing rates: Code...
Write a Python program to count occurrences of items (and retrieve the most 3 or least...
Write a Python program to count occurrences of items (and retrieve the most 3 or least 3 words). Write a Python program to sort a dictionary by keys or values in ascending or descending order by 2 methods.
Write a program in java processing. Write a program that does the following: · Assume the...
Write a program in java processing. Write a program that does the following: · Assume the canvas size of 500X500. · The program asks the user to enter a 3 digit number. · The program then checks the value of the first and last digit of the number. · If the first and last digits are even, it makes the background green and displays the three digit number at the mouse pointer. · If the two digits are odd, it...
Program must be in Python Write a program in Python whose inputs are three integers, and...
Program must be in Python Write a program in Python whose inputs are three integers, and whose output is the smallest of the three values. Input is 7 15 3
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT