In: Computer Science
Q1. 1. The phone numbers collected from questionnaire is a mess. Design a regular expression to filter out those numbers that are stored in the standard format "+00-0-0000-0000" from the file called "Q1.txt".
Q1.txt
+61392144979
+61 39214 4778
+61-3-9214-4980
+66(2)51574430
+61-3-9285-7706
Note: Only +61-3-9214-4980 and +61-3-9285-7706 are the valid results.
A regular expression is a text string which contains a combination of some special characters called metacharacters and literals and is used to match,search and replace text that follows a certain pattern.They are denoted as “regex” or “regexp” in short hand notation.
Here ,the regular expression to filter out the phone numbers that are stored in the standard format "+00-0-0000-0000" is given below.
\+\d{2}\-\d\-\d{4}\-\d{4}
Now analysing each part of the above regular expression ,
\+ - this portion is used to match the "+" symbol.
Since, "+" symbol is a metacharacter in regular expressions we use a backslash (\) infront of it to consider it as a character.
\d - represents any digit from 0 to 9, ie , this is something equivalent to [0-9] .
{2} - This is a quantifier indicating exactly 2 matches of the preceeding token.So this allows to numbers from 0-9 after the '+' symbol.
\- - this matches a '-' character.
\d - matches a single number from 0-9.
\- - this is used again and this matches a '-' character.
\d{4} - matches exactly 4 occurences of digits from 0-9.
\- - this is used again and this matches a '-' character.
\d{4} - matches exactly 4 occurences of digits from 0-9.
So,the regex \+\d{2}\-\d\-\d{4}\-\d{4} matches with only two of the above phone numbers and that are +61-3-9214-4980 and +61-3-9285-7706.
The image showing test results with an online tool to test regex is also attached below,
The following is the explanation about some other metacharacters that are used frquently in regular expressions,
/ - a backslash when used with another special cahracter indicates that ,the special cahracter needs to be treated as a literal .
[ ...] - When a certain set of characters are specified within square brackets any of the the characters can match the search string.For example, [0-9] indictes any number between 0 to 9 .
( ) - parenthesis is used to indicate the order of pattern evaluvation and replacement.
^ - usually indicates the beginning of a sentence.
[^...] - here the caret symbol is used to exclude or negate the following characters.
| - The alternation character or bar used to indicate “or” condition .Either of the strings separated by | will be used for matching.
* - The asterik symbol marks zero or more occuernces of characters to the left of the symbol.
? - character indictes zero or more occuernces of characters to the left of the symbol.
. - The dot character is used to match any single character
{} - this is used to limit repetitions by specifying minimaum and maximum number of repetitions as {min,max}
.