In: Computer Science
Understanding RegEx to identify patterns of data.
Create 10 regular expressions to filter a specific data set and explain what they each do.
RegEx notes
========
A regular expression is a text string which contains a combination of some special characters called metacharacters and literals and is used to match ,search and replace text that follows a certain pattern .They are denoted as “regex” or “regexp” in short hand notation. Regular expressions are used in software tools as well as in programming languages.
The important metacharacters used in regex and their meanings are listed below,
/ - a backslash when used with another special cahracter indicates that ,the special cahracter needs to be treated as a literal .
[ ...] - When a certain set of characters are specified within square brackets any of the the characters can match the search string.For example, [0-9] indictes any number between 0 to 9 .
( ) - parenthesis is used to indicate the order of pattern evaluvation and replacement.
^ - usually indicates the beginning of a sentence.
[^...] - here the caret symbol is used to exclude or negate the following characters.
| - The alternation character or bar used to indicate “or” condition .Either of the strings separated by | will be used for matching.
* - The asterik symbol marks zero or more occuernces of characters to the left of the symbol.
? - character indictes zero or more occuernces of characters to the left of the symbol.
. - The dot character is used to match any single character
{} - this is used to limit repetitions by specifying minimaum and maximum number of repetitions as {min,max}
Examples of regular expressions
-------------------------------------------
1. Matching a word ,even it is misspelled.
For example consider the word “separate” , mostly the spelling mistake occures with ‘a’ and ‘e’ on either sides of letter ‘r’.The regular expression to match any misspelled word for this would be like
sep[ae]r[ae]te
2.Checking for an identifier in a programming language
An identifier is a name that contains alphabets,numbers and underscores but always starts with an alphabet or an underscore.The regex to match this pattern is given below,
[A-Za-z_][A-Za-z_0-9]*
3.Matching HTML tags.
The start and end html tags usually looks like <TAG></TAG>.The regex to match this pattern without considering nested tags is ,
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
4. Ip address matching
The below regex will strictly match all the numbers in the ip adress to 0 ...255 and may disallow any leading zeroes.
\b(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\b
5.Matching floating point numbers
[-+]?[0-9]*\.?[0-9]+
6.Validating email address
^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$ 7.Matching a valid date To match a date in mm/dd/yyyy format ,the following regex can be used. (0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d 8. Master card numbers mastercard numbers either begins with 51-55 or with 2221-2720 and contains 16 digits .this can be matched using the following regex ^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$ 9.Matchingg entire line containg a word Here the whole line containing the word example will be matched. «^.*example.*$». 10.C-style hexadecimal number regex matching c-style hexadecimal number is like, \b0[xX][0-9a-fA-F]+\b