In: Computer Science
Using python!!!!
1. Copy the file web.py from class (or class notes) into your working folder.
2. Include the following imports at the top of your module (hopefully this is sufficient):
from web import LinkCollector # make sure you did 1
from html.parser import HTMLParser
from urllib.request import urlopen
from urllib.parse import urljoin
from urllib.error import URLError
3. Implement a class ImageCollector. This will be similar to the LinkCollector, given a string containing the html for a web page, it collects and is able to supply the (absolute) urls of the images on that web page. They should be collected in a set that can be retrieved with the method getImages (order of images will vary). Sample usage:
>>> ic =
ImageCollector('http://www2.warnerbros.com/spacejam/movie/jam.htm')
>>> ic.feed(
urlopen('http://www2.warnerbros.com/spacejam/movie/jam.htm').read().decode())
>>> ic.getImages()
{'http://www2.warnerbros.com/spacejam/movie/img/p-sitemap.gif', …,
'http://www2.warnerbros.com/spacejam/movie/img/p-jamcentral.gif'}
>>> ic = ImageCollector('http://www.kli.org/')
>>> ic.feed(
urlopen('http://www.kli.org/').read().decode())
>>> ic.getImages()
{'http://www.kli.org/wp-content/uploads/2014/03/KLIbutton.gif',
'http://www.kli.org/wp-content/uploads/2014/03/KLIlogo.gif'}
4. Implement a class ImageCrawler that will inherit from the Crawler developed in amd will both crawl links and collect images. This is very easy by inheriting from and extending the Crawler class. You will need to collect images in a set. Hint: what does it mean to extend? Implementation details:
a. You must inherit from Crawler. Make sure that the module web.py is in your working folder and make sure that you import Crawler from the web module.
b. __init__ - extend’s Crawler’s __init__ by adding an set attribute that will be used to store images
c. Crawl – extends Crawler’s crawl by creating an image collector, opening the url and then collecting any images from the url in the set of images being stored. I recommend that you collect the images before you call the Crawler’s crawl method.
d. getImages – returns the set of images collected
>>> c = ImageCrawler()
>>>
c.crawl('http://www2.warnerbros.com/spacejam/movie/jam.htm',1,True)
>>> c.getImages()
{'http://www2.warnerbros.com/spacejam/movie/img/p-lunartunes.gif',
…
'http://www2.warnerbros.com/spacejam/movie/cmp/pressbox/img/r-blue.gif'}
>>> c = ImageCrawler()
>>> c.crawl('http://www.pmichaud.com/toast/',1,True)
>>> c.getImages()
{'http://www.pmichaud.com/toast/toast-6a.gif',
'http://www.pmichaud.com/toast/toast-2c.gif',
'http://www.pmichaud.com/toast/toast-4c.gif',
'http://www.pmichaud.com/toast/toast-6c.gif',
'http://www.pmichaud.com/toast/ptart-1c.gif',
'http://www.pmichaud.com/toast/toast-7b.gif',
'http://www.pmichaud.com/toast/krnbo24.gif',
'http://www.pmichaud.com/toast/toast-1b.gif',
'http://www.pmichaud.com/toast/toast-3c.gif',
'http://www.pmichaud.com/toast/toast-5c.gif',
'http://www.pmichaud.com/toast/toast-8a.gif'}
5. Implement a function scrapeImages: Given a url, a filename, a depth, and Boolean (relativeOnly)., this function starts at url, crawls to depth, collects images, and then writes an html document containing the images to filename. This is not hard, use the ImageCrawler from the prior step. For example:
>>> scrapeImages('http://www2.warnerbros.com/spacejam/
movie/jam.htm','jam.html',1,True)
>>> open('jam.html').read().count('img')
62
>>>
scrapeImages('http://www.pmichaud.com/toast/',
'toast.html',1,True)
>>> open('toast.html').read().count('img')
11
link to web.py https://www.dropbox.com/s/obiyi7lnwc3rw0d/web.py?dl=0
Solution: See the code below:
-----------------------------------------------
#imorts
from web import LinkCollector
from html.parser import HTMLParser
from urllib.request import urlopen
from urllib.parse import urljoin
from urllib.error import URLError
## ImageCollector class
class ImageCollector(HTMLParser):
''' when given a url, ImageCollector
collects all the image urls found at that url.
Images will be collected in a set.
You can retrive them through getImages() method.'''
def
__init__(self,url):
HTMLParser.__init__(self)
self.url = url
self.imageURLs =
set()
def handle_starttag(self,tag,attrs):
if tag=='img':
for attr,val in attrs:
if attr=='src': # collect
if val[:4]=='http': # absolute
self.imageURLs.add( val
)
def getImages(self):
return
self.imageURLs
## ImageCrawler class
from web import Crawler
class ImageCrawler(Crawler):
'''Crawls a web pages for images'''
def __init__(self):
Crawler.__init__(self)
self.imageURLs=set()
def crawl(self,url,depth,relativeOnly):
ic =
ImageCollector(url)
ic.feed(
urlopen(url).read().decode())
imageURLs=ic.getImages()
#check signatures with
your own method
Crawler.crawl(imageURLs,depth,relativeOnly)
def getImages(self):
return
self.imageURLs
##ScrapeImages method
def scrapeImages(url, filename, depth, relativeOnly):
''' create a local html file named
filename
containing all images found at url'''
# crawl images
ic = ImageCrawler()
ic.crawl(url, depth, relativeOnly)
# write them to a file
file = open(filename,'w')
file.write( '<html><body>\n')
for image in ic.getImages():
file.write( '<img
src={}><br>\n'.format(image))
file.write('</body></html>')
file.close()
-----------------------------------------
Output:

Note: In your web.py, Crawler class code is not given. Hence, its output is not displayed. Modify and use ImageCrawler class accrodingly.