Table of Contents
In this Python Tutorial we will concentrate on Regex. A regular expression is a pattern you specify, using special characters to represent combinations of specified characters, digits, and words.
I will cover the basics of different regular expressions in the first part of this tutorial, if you are already familiar with the basics then you can directly jump to Python RegEx section in this tutorial
Introduction to regular expressions
A regular expression can be as simple as a series of characters that match a given word. For example, the following pattern matches the word “hat”; no surprise there.
hat
But what if you wanted to match a larger set of words? For example, let’s say you wanted to match the following combination of letters:
- Match a “h” character.
- Match any number of “a” characters, but at least one.
- Match a “t” character.
Here’s the regular expression that implements these criteria:
ha+t
Here,
- Literal characters, such as “h” and “t” in this example, must be matched exactly
- The plus sign (
+
) is a special character. - It does not cause the regular-expression processor to look for a plus sign. Instead, it forms a subexpression, together with “a” that says, “Match one or more ‘a’ characters.”
The pattern ha+t
therefore matches any of the following:
hat haat haaat haaaat ..
This was just an overview of regular expression, we will look into different meta characters which we can use with Python regex.
Different meta characters
These are tools for specifying either a specific character or one of a number of characters, such as “any digit” or “any alphanumeric character.” Each of these characters matches one character at a time.
Character | Name | Description |
---|---|---|
. |
Dot (Period) | Matches any one character except a newline. If the DOTALL flag is enabled, it matches any character at all. |
^ |
Caret | Matches the beginning of the string. If the MULTILINE flag is enabled, it also matches beginning of lines (any character after a newline). |
$ |
Dollar | Matches the end of a string. If the MULTILINE flag is enabled, it matches the end of a line (the last character before a newline or end of string). |
[] |
Square brackets | A set of characters you wish to match. |
\ |
Backslash | This is used to escape various characters. One of the functions of escape sequences is to turn a special character back into a literal character. |
expr* |
Wild character(Star) | Modifies meaning of expression expr so that it matches zero or more occurrences rather than one. For example, a * matches “a ”, “aa ”, and “aaa ”, as well as an empty string. |
expr+ |
Plus | Modifies meaning of expression expr so that it matches one or more occurrences rather than only one. For example, a+ matches “a ”, “aa ”, and “aaa ”. |
expr{n} |
Curly braces | Modifies expression so that it matches exactly n occurrences of expr. For example, a{3} matches “aaa ”; |
expr{m, n} |
Matches a minimum of m occurrences of expr and a maximum of n . For example, x{2,4}y matches “xxy ”, “xxxy ”, and “xxxxy ” |
|
expr{m,} |
Matches a minimum of m occurrences of expr with no upper limit to how many can be matched. For example, x{3,} finds a match if it can match the pattern “xxx ” anywhere. But it will match more than three if it can. Therefore zx(3,)y matches “zxxxxxy ”. |
|
expr{,n} |
Matches a minimum of zero, and a maximum of n , instances of the expression expr . For example, ca{,2}t matches “ct ”, “cat ”, and “caat ” but not “caaat ”. |
|
expr1 | expr2 |
Alternation | Matches a single occurrence of expr1 , or a single occurrence of expr2 , but not both. For example, a|b matches “a ” or “b ”. Note that the precedence of this operator is very low, so cat|dog matches “cat ” or “dog ”. |
() |
Parentheses | This is used to capture and group sub-patterns |
expr? |
Question mark | Modifies meaning of expression expr so that it matches zero or one occurrence of expr. For example, a? matches “a ” or an empty string. |
.
- dot (period)
A dot .
matches any one character except a newline character.
Pattern | What does pattern mean? | String | Match? | Description |
---|---|---|---|---|
<strong>c..f</strong> |
c matches the character c literally (case sensitive). matches any character (except for line terminators). matches any character (except for line terminators)f matches the character f literally (case sensitive) |
cdef |
YES | Exactly two characters between c and f |
c12f |
YES | Exactly two characters between c and f |
||
abcdef gh |
YES | Doesn't matter what is the starting or ending character. There are exactly two characters between c and f |
||
conef |
NO |
More than 2 characters between c and f |
||
c1f |
NO |
Less than 2 characters between c and f |
||
Cdef |
NO |
C is uppercase while pattern contains lowercase c |
^
- caret
A caret sign ^
matches the beginning of the string.
Pattern | What does Pattern mean? | String | Match? | Description |
---|---|---|---|---|
<strong>^a</strong> |
^ asserts position at start of a linea matches the character a literally (case sensitive) |
a pple |
YES | The first character a of apple matches the pattern |
a bcd |
YES | The first character a of apple matches the pattern |
||
Abcd |
NO |
The first character of Abcd is uppercase A while the pattern has lowercase a |
||
Cat |
NO |
The first character of Car is C is not matching the pattern |
||
<strong>^en</strong> |
^ asserts position at start of a lineen matches the characters en literally (case sensitive) |
en d |
YES | The first two character of end matches the pattern en |
east |
NO |
The first two character of east is not matching our pattern en |
||
bend |
NO |
The first two character of bend is not matching our pattern en |
||
End |
NO |
The pattern expects the first two characters to be en in lowercase while the string has uppercase E |
$
- dollar
The dollar symbol $
is used to check if a string ends with provided expression.
Pattern | What does Pattern mean? | String | Match? | Description |
---|---|---|---|---|
<strong>e$</strong> |
e matches the character e literally (case sensitive)$ asserts position at the end of a line |
apple |
YES | The last character of apple matches the pattern |
ABCDE |
NO |
The last character of ABCDE is in Uppercase while our pattern expects lowercase e in the end |
||
e |
YES | The provided string contains single character e which can be considered as both first and last so it will be a match |
||
<strong>ee$</strong> |
ee matches the characters ee literally (case sensitive)$ asserts position at the end of a line |
tree |
YES | The last two character of tree matches our pattern ee at the end of tree |
eye |
NO |
The pattern expects ee in the end of the string eye , since we have single e hence the match fails |
[]
- square backets
You can add a set of characters inside square brackets which you wish to match.
Pattern | What does pattern mean? | String | Match? | Description |
---|---|---|---|---|
BL460C_G[789]_DISK |
BL460C_G matches the characters BL460C_G literally (case sensitive)
Match a single character present in the list |
BL460C_G9 _DISK |
YES | 9 is a matching character in the list [789] while other character also match literally |
BL460C_G8 _DISK |
YES | 8 is a matching character in the list [789] while other character also match literally |
||
BL460C_G5_DISK |
NO |
5 is not part of the list [789] even when rest of the characters match so this will not be matched |
||
BL460C_G78_DISK |
NO |
78 is part of the list [789] but it can match only single character out of all the values in the list and here there are two characters so it is not a match. |
\
- backslash
The backslash can be used to “escape” special characters, making them into literal characters. The backslash can also add special meaning to certain ordinary characters—for example, causing \d
to mean “any digit” rather than a “d”. We will learn about these special sequences later in this tutorial
Pattern | What does this pattern mean? | String | Match? | Description |
---|---|---|---|---|
p@$$w0rd |
p@ matches the characters p@ literally (case sensitive)$ asserts position at the end of a line$ asserts position at the end of a linew0rd matches the characters w0rd literally (case sensitive) |
p@$$w0rd |
NO |
In the pattern $ means end of line so it won't match pa$$w0rd from the string. |
p@\$\$w0rd |
p@ matches the characters p@ literally (case sensitive)\$ matches the character $ literally (case sensitive)\$ matches the character $ literally (case sensitive)w0rd matches the characters w0rd literally (case sensitive) |
p@$$w0rd |
YES | Now since we are using backslash as a escape sequence to now $ is considered as a string instead of meta character. |
*
- Wild character (Star)
The asterisk (*
) modifies the meaning of the expression immediately preceding it, so the a
, together with the *
, matches zero or more “a
” characters.
Pattern | What does Pattern mean? | String | Match? | Description |
---|---|---|---|---|
<strong>ca*t</strong> |
c matches the character c literally (case sensitive)a* matches the character a literally (case sensitive)* Quantifier - Matches between zero and unlimited times, as many times as possible, giving back as neededt matches the character t literally (case sensitive) |
cat |
YES | a is followed by t where a is present in the string for zero or more times |
ct |
YES | a can be present zero or more times in our string so this is also a match |
||
caat |
YES | a can be present zero or more times and here a is present twice followed by t so this is a match |
||
caaat s |
YES | a can be present zero or more times and here 'a' is present thrice followed by t so this is a match |
||
castle |
NO |
a is present zero or more times but it is not followed by t so not a match |
||
cart |
NO |
a is present zero or more times but it is not followed by t so not a match |
+
- plus
The plus (+
) sign will match exactly one or more characters of the preceding expression.
Pattern | What does Pattern mean? | String | Match? | Description |
---|---|---|---|---|
<strong>ca+t</strong> |
c matches the character c literally (case sensitive)a+ matches the character a literally (case sensitive)+ Quantifier - Matches between one and unlimited times, as many times as possible, giving back as neededt matches the character t literally (case sensitive) |
cat |
YES | a is followed by t where a is present in the string for one or more times |
ct |
NO |
a must be present one or more times in our string so this is not a match |
||
caat |
YES | a can be present one or more times and here a is present twice followed by t so this is a match |
||
caaat s |
YES | a can be present one or more times and here a is present thrice followed by t so this is a match |
||
castle |
NO |
a is present more or more times but it is not followed by t so not a match |
||
cart |
NO |
a is present one or more times but it is not followed by t so not a match |
?
- question
This means that the preceding expression can be present zero or one times only. So this can be helpful when you feel a certain character in a string can be there or may be not.
Pattern | What does this pattern mean? | String | Match? | Description |
---|---|---|---|---|
<strong>cas?t</strong> |
ca matches the characters ca literally (case sensitive)s? matches the character s literally (case sensitive)? Quantifier - Matches between zero and one times |
ca t |
YES | s is present zero times so it is a match |
cas t |
YES | s is present one time so it is a match |
||
casst |
NO |
s is present more than one time so it is not a match |
{}
- curly braces
You can use {n,m}
curly braces to match exactly the specified number of occurrences in a string. Here n means the minimum number of occurrence while m represents maximum number of occurrence to match.
Pattern | What does pattern mean? | String | Match? | Description |
---|---|---|---|---|
<strong>al{2}</strong> |
a matches the character a literally (case sensitive)l{2} matches the character l literally (case sensitive){2} Quantifier — Matches exactly 2 times |
call |
YES | The a character is followed by l two times as expected by the pattern |
tale |
NO |
The a character is followed by l but l is present only single time while {2} expects l to be present at least 2 times |
||
fall s |
YES | The a character is followed by l and l is present 2 times so it is a match |
||
troll |
NO |
l is present two times as expected by {2} but a character is missing before l so it is not a match |
|
- alteration
The alteration operator matches a single occurrence of expr1, or a single occurrence of provided expression, but not both.
Pattern | What does pattern mean? | String | Match? | Description |
---|---|---|---|---|
<strong>cat|dog</strong> |
1st Alternative cat cat matches the characters cat literally (case sensitive)
2nd Alternative |
cat tle |
YES | The pattern will match cat in the string cattle |
boggy |
NO |
As there is no cat or dog in this string boggy , there is no match |
||
dog gy |
YES | The pattern will match dog in the string doggy |
||
battle |
NO |
As there is no cat or dog in this string battle , there is no match |
()
- parenthesis (group)
Causes the regular-expression evaluator to look at all of expr
as a single group. There are two major purposes for doing so. First, a quantifier applies to the expression immediately preceding it; but if that expression is a group, the entire group is referred to. For example, (ab)+ matches “ab”, “abab”, “ababab”, and so on.
Pattern | What does pattern mean | String | Match? | Description |
---|---|---|---|---|
<strong>(ac)t</strong> |
ac matches the characters ac literally (case sensitive)t matches the character t literally (case sensitive) |
fact |
YES | The pattern ac is present followed by t hence this is a match |
cat |
NO |
The pattern ac should be in the same order as it is a group item so this is not a match. |
||
<strong>(a|c)t</strong> |
a matches the character a literally (case sensitive)c matches the character c literally (case sensitive) |
fact |
YES | As we are using alteration with group, either of a or c followed by t should be present so it is a match |
cat |
YES | Here also the pattern expects either a or c in the string so this is also a match |
Special Sequences
These are different set of pre defined special sequences which can be used to capture different types of patterns in a string.
\n
as a newline.Special character | Description |
---|---|
\A |
Matches beginning of a string |
\b |
Word boundary which returns a match where the specified characters are at the beginning or at the end of a word. For example, r'at\b' matches 'cat ' and 'at ' but not 'cats '. |
\B |
Nonword boundary which returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word. For example, r'at\B' matches bats and atlanta but not cat |
\d |
Any digit character. This includes the digit characters 0 through 9. |
\D |
Returns a match where the string DOES NOT contain digits |
\s |
Any whitespace character; may be blank space or any of the following: \t , \n , \r , \f , or \v |
\S |
Any character that is not a white space, as defined just above. |
\w |
Matches any alphanumeric character (letter or digit) or an underscore (_) |
\W |
Matches any character that is not alphanumeric |
\Z |
Matches the end of a string |
Examples
Here I have consolidated all these special sequence and different examples to give you an overview on individual operator:
Regex | Pattern | Test String | Match? | Explanation |
---|---|---|---|---|
\A |
r'\Ais' |
is this good? |
YES | Since is is at the starting so it is a match |
I hope this is good |
NO |
Since is is not at the starting so no match |
||
\b |
r'\bPython' |
Python is easy |
YES | Python is at the beginning of the string so it is a match |
How easy is Python |
YES | Python is again at the beginning of the string so it is a Match. Word boundary is for a word and not sentence so doesn't matter is Python is not at the starting of sentence |
||
Python 2 is easy |
YES | Python is again at the beginning of the string Python2 so it is a match. It doesn't matter if Python2 was at the beginning of sentence. |
||
Download iPython |
NO |
Python is not at the beginning of the string iPython so no match |
||
r'Python\b' |
Python is easy |
YES | Python is itself a string so it is a match |
|
How easy is Python |
YES | Again Python itself is a word in the sentence, it's position in the sentence doesn't matter so this is a match |
||
Python2 is easy |
NO |
Python is supposed to be at the end so this is not a match |
||
Download iPython |
YES | The iPython string ends with Python so this is a match |
||
\B | r'\BPython' |
Python is easy |
NO |
\B assert position where \b does not match |
Python2 is easy |
NO |
|||
Download iPython |
YES | |||
r'Python\B' |
Python is easy |
NO |
\B assert position where \b does not match |
|
Python 2 is easy |
YES | |||
Download iPython |
NO |
|||
\d |
r'\d' |
Passw 0 rd |
YES | The string contains a numerical digit between 0-9 |
Password |
NO |
There are no numerical digit in the provided string | ||
\D |
r'\D' |
Passw0rd |
YES | \D matches any character that's not a digit (equal to [^0-9] ) |
12345 |
NO |
|||
\s |
r'\s' |
Hello World |
YES | \s matches any whitespace character (equal to [\r\n\t\f\v ] ) |
HelloWorld |
NO |
No whitespace character | ||
\S |
r'\S' |
Hello World |
YES | \S is opposite to \s |
|
NO |
|||
\w |
r'\w' |
Passw0rd_123 |
YES | \w matches any word character (equal to [a-zA-Z0-9_] ) |
[{(<>)}] |
NO |
|||
\W |
r'\w' |
Passw0rd_123 |
NO |
\W is opposite to \w |
[{(<>)}] |
YES | |||
\Z |
r'Python\Z' |
I like Python |
YES | \Z asserts position at the end of the string, or before the line terminator right at the end of the string (if any). |
Python is easy |
NO |
Python regex
- The
re
module supplies the attributes listed in the earlier section. - It also provides a function that corresponds to each method of a regular expression object (
findall
,match
,search
,split
,sub
, andsubn
) each with an additional first argument, a pattern string that the function implicitly compiles into a regular expression object. - It’s generally preferable to compile pattern strings into regular expression objects explicitly and call the regular expression object’s methods, but sometimes, for a one-off use of a regular expression pattern, calling functions of module
re
can be slightly handier.
Iterative searching with re.findall
One of the most common search tasks is to find all substrings matching a particular pattern. The syntax to use findall
would be:
list = re.findall(pattern, target_string, flags=0)
Here in this syntax, pattern
is a regular-expression string or precompiled object, target_string
is the string to be searched, and flags
is optional. The return value of re.findall
is a list of strings, each string containing one of the substrings found. These are returned in the order found.
Example-1: Find all the digits in a string
In this example we will search for all digit in the provided string.
Here we are using \d
operator with re.findall
to find all the digits and '\d+
' means to match all digits present one or more times from the provided string. Output from this script:
~]# python3 regex-eg-1.py
['12', '123', '78', '456']
Example-2: Find words with 6 or more characters
In this example we will write a sample code to find all the words with 6 or more than 6 characters from the provided string using re.findall
.
Here we are using re.findall
with \w
to match alpha numeric character in combination with {6, }
to list words with minimum 6 letters or more. Output from this script:
~]# python3 regex-eg-2.py
['testing45', 'test37', 'testing1456']
Example-3: Split all characters in the string
We have a string which contains mathematical operators but we want it to be recognized as strings and each character should be broken down into a list of strings.
Here we are using re.findall
to get a list of strings with individual characters from the provided text. Output from this script:
~]# python3 regex-eg-3.py
['12', '15', '+', '3', '100', '-', '*']
-
) has a special meaning within square brackets unless it appears at the very beginning or end of the range which is why we have placed it accordingly in our sample code.
Example-4: Find all the vowels from the string
In this example we will identify all the vowels from the provided string:
Output from this script:
~]# python3 regex-eg-4.py
['12', '15', '+', '3', '100', '-', '*']
Example-5: Find vowels case-insensitive
In the last example we listed the vowels from a string but that was case sensitive, if we had some text with UPPERCASE then they won't be matched. To perform case insensitive match we need to add an additional IGNORECASE flag using flags=re.I
or flags=re.IGNORECASE
Output from this script:
~]# python3 regex-eg-5.py
['i', 'i', 'o', 'e', 'A', 'E', 'e']
The re.split
function
Another way to invoke regular expressions to help analyze text into tokens is to use the re.split
function. The general syntax to use re.split
would be:
list = re.split(pattern, string, maxsplit=0, flags=0)
In this syntax,
pattern
is a regular-expression pattern supporting all the grammar shown until now; however, it doesn’t specify a pattern to find but to skip over. All the text in between is considered a token. So thepattern
is really representative of token separators, and not the tokens themselves.- The
string
, as usual, is the target string to split into tokens. - The
maxsplit
argument specifies the maximum number of tokens to find. If this argument is set to 0, the default, then there is no maximum number.
Example-1: Split using whitespace
In this example we have a string where we will split the line using whitespace
Output from this script:
~]# python3 regex-eg-1.py
['NAME="/dev/sda"', 'PARTLABEL=""', 'TYPE="disk"']
Example-2: Strip using whitespace from a file
In this example we will create a list of elements using whitespace as stripping pattern. We will take the output of who
command into a file who.txt
~]# who > who.txt
Now using our python script we will strip each element into a list.
#!/usr/bin/env python3 import re f = open('who.txt', 'r') for eachline in f: print(re.split(r'\s\s+|\t', eachline)) f.close()
We have defined \s\s+
which means at least two whitespace characters with an alteration pipe and \t
to match tab. Output from this script:
~]# python3 regex-eg-2.py
['root', 'pts/0', '2020-11-02 12:07 (10.0.2.2)\n']
['root', 'pts/1', '2020-11-02 19:14 (10.0.2.2)\n']
Now at the end of each line we are getting a newline character, to strip that we can use rstrip(\n)
so the updated code would be:
Output from this script:
~]# python3 regex-eg-3.py
['root', 'pts/0', '2020-11-02 12:07 (10.0.2.2)']
['root', 'pts/1', '2020-11-02 19:14 (10.0.2.2)']
Replace text using re.sub()
Another tool is the ability to replace text—that is, text substitution. We might want to replace all occurrences of a pattern with some other pattern. This almost always involves group tagging, described in the previous section.
The re.sub
function performs text substitution.
re.sub(find_pattern, repl, target_str, count=0, flags=0)
In this syntax, find_pattern
is the pattern to look for, repl
is the regular-expression replacement string, and target_str
is the string to be searched. The last two arguments are both optional.
The return value is the new string, which consists of the target string after the requested replacements have been made.
Example-1: Replace multiple spaces with single space
In this example I have a string with multiple whitespace characters where we will use re.sub()
to replace multiple whitespace with single whitespace character.
Output from this script:
~]# python3 regex-eg-1.py
abc def ghi ktm
Example-2: Replace duplicates
In this example I have a string with multiple duplicate words which I wish to replace with single occurrence of each duplicate word.
Here the replacement string, contains only a reference to the first half of that pattern. This is a tagged string, so this directs the regular-expression evaluator to note that tagged string and use it as the replacement string.
r'\1'
Second, the repeated-word test on “This this
” will fail unless the, flags argument is set to re.I
(or re.IGNORECASE
).
\1
; therefore, if you don’t specify the replacement text as a raw string, nothing works—unless you use the other way of specifying a literal backslash \\1
Output from this script:
~]# python3 regex-eg-2.py
This is a Python tutorial
Searching a string for patterns using re.search
In this section we will learn how to find the first substring that matches a pattern. The re.search
function performs this task using following syntax:
match_obj = re.search(pattern, target_string, flags=0)
In this syntax, pattern
is either a string containing a regular-expression pattern or a precompiled regular-expression object; target_string
is the string to be searched. The flags
argument is optional and has a default value of 0.
The function produces a match object if successful and None otherwise. This function is close to re.match in the way that it works, except it does not require the match to happen at the beginning of the string.
By default re.search
will search into complete string and will print only the first matching pattern. Let us verify this concept, here I have a text which contains 'python' two times. So we will use re.search
to find 'python' word in the sentence.
#!/usr/bin/env python3 import re line = "This is python regex tutorial. We are using python3" pat = r'\bpython' print(re.search(pat, line))
Output from this script:
~]# python3 regex-eg-1.py
<_sre.SRE_Match object; span=(8, 14), match='python'>
As you see the re.search
function has stopped searching after first match i.e. 'python' even when 'python3' was also a match for our pattern.
Match Object
If you observe the output from re.search
, we get a bunch of information along with the matched object. To further optimize the output and get the desired information we can use match object group with re.search
#!/usr/bin/env python3 import re line = "This is python regex tutorial. We are using python3" pat = r'\bpython' match_ob = re.search(pat, line) print('matched from the pattern: ', match_ob.group()) print('starting index: ', match_ob.start()) print('ending index: ', match_ob.end()-1) print('Length: ', match_ob.end() - match_ob.start())
Here I am printing different information based on the output from re.search
using the index position. Output from this script:
~]# python3 regex-eg-2.py
matched from the pattern: python
starting index: 8
ending index: 13
Length: 6
Example-1: Search for a pattern in a log file
Let us take a practical example where we will go through a log file and print all the lines with text CRITICAL
Output from this script:
~]# python3 regex-eg-1.py
2019-11-13T13:03:03Z CRITICAL Error: Unable to find a match
2019-11-13T13:03:14Z CRITICAL Importing GPG key 0x8483C65D:
2019-11-13T13:11:06Z CRITICAL Error: No Matches found
Refining matches with re.match
The re.match
function returns either a match object, if it succeeds, or the special object None
, if it fails. The syntax to use re.match
would be:
re.match(s,start=0,end=sys.maxint)
Returns an appropriate match object when a substring of s
, starting at index start
and not reaching as far as index end
, matches r
. Otherwise, match
returns None
.
Let us use our example from previous section where I have added an additional python if else block:
#!/usr/bin/env python3 import re line = "This is python regex tutorial. We are using python3" pat = r'\bpython' match_ob = re.match(pat, line) if match_ob: print(match_ob) print('matched from the pattern: ', match_ob.group()) print('starting index: ', match_ob.start()) print('ending index: ', match_ob.end()-1) print('Length: ', match_ob.end() - match_ob.start()) else: print('No match found')
Here now instead of re.search
we will use re.match
to find our pattern in the provided string. Output from this script:
~]# python3 regex-eg-1.py
No match found
Now even though we have python in our string, re.match
returns "No match found
", this is because re.match
will only search at the first index position. So to get a match we will rephrase our text in the script:
#!/usr/bin/env python3 import re line = "python regex tutorial. We are using python3" pat = r'\bpython' match_ob = re.match(pat, line) if match_ob: print(match_ob) print('matched from the pattern: ', match_ob.group()) print('starting index: ', match_ob.start()) print('ending index: ', match_ob.end()-1) print('Length: ', match_ob.end() - match_ob.start()) else: print('No match found')
Output from this script:
~]# python3 regex-eg-1.py
<_sre.SRE_Match object; span=(0, 6), match='python'>
matched from the pattern: python
starting index: 0
ending index: 5
Length: 6
So now re.match
was able to match the pattern since the pattern was available at index position 0 so the basic difference between re.search
and re.match
is that re.match
will search for the pattern at first index while re.search
will search for the pattern in the entire string.
Example-1: Match for a telephone number
In this example we will collect telephone number from the user and using re.match
we will confirm if the syntax of the input number is correct or incorrect. Normally in US, the telephone syntax is:
xxx-xxx-xxxx
Sample script:
Here,
\d{3}
matches a digit (equal to[0-9]
) where{3}
Quantifier — Matches exactly 3 times- Match a single character present in the list below
[-]
where - matches the character - literally (case sensitive) \d{3}
matches a digit (equal to[0-9]
) where{3}
Quantifier — Matches exactly 3 times- Match a single character present in the list below
[-]
where - matches the character - literally (case sensitive) \d{3,4}
matches a digit (equal to[0-9]
) where{3,4}
Quantifier — Matches between 3 and 4 times, as many times as possible, giving back as needed (greedy)
Output from this script for different inputs:
~]# python3 regex-eg-1.py Match ~]# python3 regex-eg-1.py Enter telephone number: 123-456-1111 Match ~]# python3 regex-eg-1.py Enter telephone number: 1234-123-111 No Match
Using re.compile
If you’re going to use the same regular-expression pattern multiple times, it’s a good idea to compile that pattern into a regular-expression object and then use that object repeatedly. The regex package provides a method for this purpose called compile with the following syntax:
regex_object_name = re.compile(pattern)
You will understand better with this example. Here I have used some of the python regex function which we learned in this tutorial. Now if you see we had to use the same pattern multiple times for different regex search so to avoid this we can create a regex pattern object and then use this object to perform your search.
#!/usr/bin/env python3 import re line = "This is python regex tutorial using python3" pat = r'\bpython\d' print('using re.search: ', re.search(pat, line)) print('using re.findall: ', re.findall(pat, line)) print('using re.match: ', re.match(pat, line)) pat_ob = re.compile(pat) print('Pattern Object: ', pat_ob) print('using re.compile with re.search: ', pat_ob.search(line)) print('using re.compile with re.findall: ', pat_ob.findall(line)) print('using re.compile with re.match: ', pat_ob.match(line))
As you can see the output from first section without re.compile
and second section with re.compile
has same output:
~]# python3 regex-eg-2.py using re.search: <_sre.SRE_Match object; span=(36, 43), match='python3'> using re.findall: ['python3'] using re.match: None Pattern Object: re.compile('\\bpython\\d') using re.compile with re.search: <_sre.SRE_Match object; span=(36, 43), match='python3'> using re.compile with re.findall: ['python3'] using re.compile with re.match: None
Conclusion
Python regex is a very vast topic but I have tried to cover the most areas which are used in most codes. re.search
and re.match
can be confusing but you have to remember that re.match will search only at the first index position while re.search
will search for the pattern in entire string.
We mostly end up using re.compile
, you could perform these tasks without precompiling a regular-expression object. However, compiling can save execution time if you’re going to use the same pattern more than once. Otherwise, Python may have to rebuild a state machine multiple times when it could have been built only once.