In: Computer Science
Deliverables
There is one deliverable for this assignment
Make sure the script obeys all the rules in the Script Requirements page.
What is Special About This Assignment
This homework assignment is going to be different from the other assignments.
You will have to do very little coding for this assignment.
Instead, I will supply you with a function that will test regular expressions using regular expressions contained in the following variables:
regex_1
regex_2
regex_3
regex_4
regex_5
regex_6
regex_7
regex_8
regex_9
regex_10
Code for This Assignment
You must copy the code below into your script
import re def test_regular_expression(regex, test_string) : pattern = re.compile(r'' + regex ) match = pattern.search(test_string) if match : try : return match.group(1) except : print('Match found but no substring returned') return '' else: print(regex, 'does not match', test_string) return '' line_1 = 'Mar xxxxx16xxxxxxx 11:58:13 xxxxxxxxxxxxxxx 65.96.149.57 port 60695 Wed' line_2 = ' 205.236.184.32 09 Feb 2014:00:03:21 +0000 12_class_notes_it117.html HTTP/1.1" 200 56810323' regex_1 = regex_2 = regex_3 = regex_4 = regex_5 = regex_6 = regex_7 = regex_8 = regex_9 = regex_10 = print('regex_1', regex_1, '\t returned ', test_regular_expression(regex_1, line_1)) print('regex_2', regex_2, '\t returned ', test_regular_expression(regex_2, line_1)) print('regex_3', regex_3, '\t returned ', test_regular_expression(regex_3, line_1)) print('regex_4', regex_4, '\t returned ', test_regular_expression(regex_4, line_1)) print('regex_5', regex_5, '\t returned ', test_regular_expression(regex_5, line_1)) print('regex_6', regex_6, '\t returned ', test_regular_expression(regex_6, line_1)) print('regex_7', regex_7, '\t returned ', test_regular_expression(regex_7, line_2)) print('regex_8', regex_8, '\t returned ', test_regular_expression(regex_8, line_2)) print('regex_9', regex_9, '\t returned ', test_regular_expression(regex_9, line_2)) print('regex_10', regex_10,'\t returned ', test_regular_expression(regex_10,line_2))
Specification
Define the variables below.
The value of each variable should return the value given below when run on the string listed.
Variable | Value Returned | String |
---|---|---|
regex_1 | Month name | line_1 |
regex_2 | Day number | line_1 |
regex_3 | Hours, minutes, seconds | line_1 |
regex_4 | IP address | line_1 |
regex_5 | Port number | line_1 |
regex_6 | Day of the week | line_1 |
regex_7 | Two digit day number | line_2 |
regex_8 | Month | line_2 |
regex_9 | Year | line_2 |
regex_10 | The filename with extension | line_2 |
You MAY NOT use ordinary characters in your regular expression values.
For example you cannot use "html" when matching the filename.
You MAY use the period, . , as an ordinary character.
Don't forget that if you want to use a meta-character as a literal, like .you must escape it with a \.
Script for this assignment
Open an a text editor and create the file hw7.py.
You can use the editor built into IDLE or a program like Sublime.
Suggestions
Write this program in a step-by-step fashion using the technique of incremental development.
In other words, write a bit of code, test it, make whatever changes you need to get it working, and go on to the next step.
Put a # before each print statement in the test code at the bottom of the file, except for the one that prints regex_1.
For each regular expression, don't write the entire expression all at once.
Instead build it up little by little testing as you go.
When you get this regular expression working, remove the # from the next statement and repeat the procedure.
In order to make regex, we need to understand same basic representation.
\w - stands for any alphanumeric element i.e and alphabet character small or capital or a digit.
\d - stands for any one of digit
. - means any character except end of line.
\ - used to escape a character, useful when we want to match a
character which otherwise has a special meaning
like . in order to match a full stop if we give . then it will
match any character instead we have to give
\. i.e. \ followed by .
[] we can specify multiple character inside square bracket if
the any of the character within the brackets matches
then the match is successful, but only one character from input is
matched.
- - can be used inside [] to indicate a range of consecutive character i.e. a-z means any character from a to z
+ any regular expression followed by + means one or more
occurence of the regular expression. i.e. \d+ means
one or more repeated occurance of digit.
{n} any number given within the curly braces will look for that
many repition of previous regular expression. i.e.
\d{3} will look for 3 digits.
There are many more provided by python re package but for our case this are sufficient.
With this lets look at the program.. more description of the regular expression is in the code where it is given.
=============
import re
def test_regular_expression(regex, test_string) :
pattern = re.compile(r'' + regex )
match = pattern.search(test_string)
if match :
try :
return match.group(0)
except :
print('Match found but no substring returned')
return ''
else:
print(regex, 'does not match', test_string)
return ''
line_1 = 'Mar xxxxx16xxxxxxx 11:58:13 xxxxxxxxxxxxxxx
65.96.149.57 port 60695 Wed'
line_2 = ' 205.236.184.32 09 Feb 2014:00:03:21 +0000
12_class_notes_it117.html HTTP/1.1" 200 56810323'
regex_1 = "^[A-Za-z]{3}"
# We need to look for 3 alphabets in the beginning of string. ^
indicates that whatever regular expression follows
# has to occur in the beginning, same expression if present and
found in middle of string will not match
# [A-Za-z] is regular expression for matching any character from
A-Z or a-z i.e. any alphabet. This followed
# by {3} indicates that 3 such occurence of alphabets. So
altogether the regular expression lools for 3 alphabets at
# the beginning of string which in our case returns -- Mar
regex_2 = "\d\d"
# We need to match the day of month here. which is the first set of
two digits in the string so we gave \d\d, each \d
# stand for a digit and this regular expression would match 2
digits in the string. Since we have used
# match.group(0) in test_regular_expression subroutine it only
returns the first such two digit matched
# which in our case is day of month.
regex_3 = "\d\d:\d\d:\d\d"
# We want here time in hh:mm:ss format so we specify two digits
followed by : then two more digits and : and two
# more digits this returns the time.
regex_4 = "\d+\.\d+\.\d+\.\d+"
# To match the ip address we use \d+ as each part of ip address can
be 1, 2 or 3 digits. so we use \d+ by + we
# indicate that there should be one of more occurance of \d i.e.
digit followed by a dot sign, as mentioned earlier in
# the beginning '.' (dot) has special meaning for regular
expression so to literally match a dot we escape it by \ we
# repeat this for all the 4 parts of ip address.
regex_5 = "(?<=port )\d+"
# port number is nothing but 1 or more digits so we have \d+, but
here comes the catch, if we just give this it will
# match any number right in the beginnig. We know the port number
is the number that is present after the word
# port. So this is what we have to specify, first look for "port "
and then match the subsequent number. We achieve
# this using (?<=xxxxx) where xxxxx is the prefix to be looked
for. So by specifying (?<=port) before \d+, we ensure
# match happens only for that number which is preceded by "port ",
and it solves our problem. Also this prefix is
# only sought for locating the number and doesnot form part of the
final match. so the returned value would just be
# numeric port number.
regex_6 = "\w+$"
# \w as mentioned earlier indicates any alphanumeric. $ is used to
indicate that this match should followed by end
# of line. So this would return one or more occurance of
alphanumeric character at the end of line. Since the day
# of week is at end of line, it returns the same, even though there
are many occurance of one or more alpha
# numeric in the line, due to addition of $ in the regular
expression.
regex_7 = "\d\d(?= \w\w\w)"
# in line 2 the day of month is two digit but not any two digit. To
make sure that it is the one we want we look for
# the two digits that is followed by 3 alphabets (i.e. month name)
so this time we specify (?=xxx) after our pattern
# this will ensure to return only that match which is followed by
xxx i.e. \w\w\w three alphanumeric character here
# thus returning the day of month. Even though there are multiple
two digits occurance before that.
regex_8 = "[A-Za-z]{3}"
#This patter looks for 3 repeated occurance of alphabets. Which
turns out to be our month name in line2
regex_9 = "\d+(?=:)"
#For year we look for numeric followed by : so again we use (?=:)
to specify that match should look for following :
# and thus return the year we are looking for.
regex_10 = "\w+\.[A-Za-z]+"
# For filename we look for pattern to match alphanumeric string
followed by . (fot character) and alphabet string for
#extension. so the above given pattern looks for one or more
occurance of alphanumeric character which is filname #and then .
(dot) character due to \. and again one or more occurance of
alphabet character dur to [A-Za-z]+, thus
#returning filename.
print('regex_1', regex_1, '\t\t returned ',
test_regular_expression(regex_1, line_1))
print('regex_2', regex_2, '\t\t\t returned ',
test_regular_expression(regex_2, line_1))
print('regex_3', regex_3, '\t\t returned ',
test_regular_expression(regex_3, line_1))
print('regex_4', regex_4, '\t returned ',
test_regular_expression(regex_4, line_1))
print('regex_5', regex_5, '\t\t returned ',
test_regular_expression(regex_5, line_1))
print('regex_6', regex_6, '\t\t\t returned ',
test_regular_expression(regex_6, line_1))
print('regex_7', regex_7, '\t returned ',
test_regular_expression(regex_7, line_2))
print('regex_8', regex_8, '\t\t returned ',
test_regular_expression(regex_8, line_2))
print('regex_9', regex_9, '\t\t returned ',
test_regular_expression(regex_9, line_2))
print('regex_10', regex_10,'\t returned ',
test_regular_expression(regex_10,line_2))
======File ends ======
Output
=======
regex_1 ^[A-Za-z]{3} returned Mar
regex_2 \d\d returned 16
regex_3 \d\d:\d\d:\d\d returned 11:58:13
regex_4 \d+\.\d+\.\d+\.\d+ returned 65.96.149.57
regex_5 (?<=port )\d+ returned 60695
regex_6 \w+$ returned Wed
regex_7 \d\d(?= \w\w\w) returned 09
regex_8 [A-Za-z]{3} returned Feb
regex_9 \d+(?=:) returned 2014
regex_10 \w+\.[A-Za-z]+ returned
12_class_notes_it117.html