In: Computer Science
Write a python program to finish a big-data processing task --- finding out most frequently used words on Wikipedia pages.
The execution of the program generates a list of distinct words used in the wikipedia pages and the number of occurrences of each word on these web pages. The words are sorted by the number of occurrences in ascending order. The following is a sample of output generated for 4 Wikipedia pages.
126 that
128 by
133 as
149 or
160 for
164 is
189 on
191 from
345 to
375 advertising
443 a
473 and
480 in
677 of
1080 the
Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In the project, you need to analyze all the pages for the Wikipedia entries with two capital letters. For example, the Wikipedia page for entry "AC" is https://en.wikipedia.org/wiki/AC . Use urllib or urllib2 library to download a page.
A HTML page has HTML tags, which should be removed before the
analysis. Use BeautifulSoup library to convert a text from
HTML format to text format.
from bs4 import BeautifulSoup,Comment
import urllib3
import itertools
import requests
# to install these requirements do pip3 install bs4,urllib3,requests
def textToextract(passableValue):
textDict =
{"that":0,"by":0,"as":0,"or":0,"for":0,"is":0,"on":0,"from":0,"to":0}
# we can add as many string we need to search in
the above dictionary
textList = textDict.keys()
# this creats a list of all the keys in
textDict
# this for loops through the text we get from
get_text and then gets the count of the keys we have in
textDict
for key in textList:
textDict[key] =
passableValue.count(key)
# getting the count of
the string we need to search in the text we get after extracting
from get_text in the main for loop
print(textDict)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
#this to disable warning related urllib which is unauthorized GET
which stops the program
alphabets =
["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
combinations =
list(map("".join,(itertools.permutations(alphabets,2))))
""" this takes the alphabets list and then
itertools.permutations calculates all the perimutattions possible
of the alphabets list
map joins the results we get as two separate values
list convers the acquire result to a list
the resultant list is something like
["AB","AC"....,"ZA"]
"""
http = urllib3.PoolManager()
for i in combinations:
url = "https://en.wikipedia.org/wiki/" + i
html = requests.get(url)
r = html.content
# we give a get request to get the contents from
the url
soup = BeautifulSoup(r, "html.parser")
# beautifulsoup parses the html so its in
readable format
extracted = soup.find(id="bodyContent")
# as we only need the body content of each page
we find it as it has the id bodyContent
for element in extracted(text=lambda text:
isinstance(text, Comment)):
element.extract()
# this loop remove the comments from the result
its not required but the result will be not accurate as it will
search for the string we need in the comments too
extracted = extracted.get_text()
# this removes the html tag and we get only the
text
textToextract(extracted)