In: Computer Science
Please support with a python response. The prompt was:
To locate a link on a webpage, we can search for the following
three things in order:
- First, look for the three character string '<a '
- Next, look for the following to close to the first part '>':
Those enclose the URL
- Finally, look for three characters to close the element:
'</a': which marks the end of the link text
The anchor has two parts we are interested in: the URL, and the link text.
Write a function that takes a URL to a webpage and finds the **first link** on the page. Your function should return a tuple holding two strings: the URL and the link text.
We already created a program that fetches the contents of a website and pastes them into a list in a prior assignment:
import urllib.request import sys def fetch_contents(website): "Return the contents of this webpage as a list of lines" try: res = [] with urllib.request.urlopen(website) as f: text = f.read().decode('utf-8') # Break the page into lines text = text.split('\n') for line in text: res.append(line) return res except urllib.error.URLError as e: print(e.reason) return [] if (len(sys.argv) != 2): print(f"Usage: python read_url.py <website>") else: lst = fetch_contents(sys.argv[1]) # Now display the contents for line in lst: print(line)
The hard part is done. Once I have all of the text, I now need to parse through it and look for what was prompted above. We are not allowed to use Beautiful Soup or any of the other libraries, which I wasn't aware of when I started working on it. How can I manually walk through the list, find and extract the first URL, and return it as a tuple (url, link text)?
# -*- coding: utf-8 -*-
"""
Created on Sun Oct 25 10:51:55 2020
@author: Unknown
"""
import urllib.request
import sys
import re
#The function Find uses the built in regex function to search the matching pattern
def Find(pat, text):
match = re.findall(pat, text)
if match:
foundtext = match
print('The matching line is ', text)
else:
#print('No match found')
foundtext = None
return foundtext
def fetch_contents(website):
"Return the contents of this webpage as a list of lines"
try:
res = []
with urllib.request.urlopen(website) as f:
text = f.read().decode('utf-8')
# Break the page into lines
text = text.split('\n')
for line in text:
res.append(line)
return res
except urllib.error.URLError as e:
print(e.reason)
return []
#The function makes a call to the find function to search for the string matching the pattern
def firstlink_tuple(pat,line):
matchobj = Find(pat, line)
return matchobj
if (len(sys.argv) != 2):
print(f"Usage: python read_url.py <website>")
else:
lst = fetch_contents(sys.argv[1])
# Now display the contents
#Regex pattern to search the string.The'(' and ')' indicates the text to be
#returned by Regex.By default the matched content is returned as tuple
pat = r'<a\shref=(.*)>(.*)</a'
for line in lst:
matchobj = firstlink_tuple(pat,line)
if matchobj:
print('The matching link and text in tuple is : ', matchobj)
exit()