Regular Expressions in Python
Regular expressions are special character sequences that help you easily check whether a string matches a certain pattern.
In Python, the re
module is used to work with regular expressions.
The re
module provides a set of functions that allow you to perform pattern matching, searching, and replacement operations in strings.
The re
module gives Python full regular expression capabilities.
This section mainly introduces commonly used regular expression handling functions in Python. If you're unfamiliar with regular expressions, you can refer to our Regular Expression Tutorial.
re.match()
Function
The re.match()
function attempts to match a pattern from the start of the string. If the match is successful at the start, it returns a match object; otherwise, it returns None
.
Function Syntax:
re.match(pattern, string, flags=0)
Function Parameters:
Parameter | Description |
---|---|
pattern | The regular expression to match. |
string | The string to be matched. |
flags | Flag parameters to control the matching behavior (e.g., case sensitivity, multi-line matching, etc.). Refer to Regular Expression Modifiers - Optional Flags. |
If the match is successful, the re.match()
method returns a match object; otherwise, it returns None
.
You can use the group(num)
or groups()
methods of the match object to get the matched parts of the expression.
Match Object Methods:
Method | Description |
---|---|
group(num=0) | Returns the entire matched expression. You can specify multiple group numbers to return a tuple containing the corresponding group values. |
groups() | Returns a tuple containing all the captured groups. Groups are numbered from 1 to the number of groups in the pattern. |
Example
#!/usr/bin/python import re print(re.match('www', 'www.pmeve.com').span()) # Matches at the start print(re.match('com', 'www.pmeve.com')) # Does not match at the start
The output of the above example is:
(0, 3) None
Another Example
#!/usr/bin/python3 import re line = "Cats are smarter than dogs" # .* matches any character except newlines (\n, \r) # (.*?) is a non-greedy match, capturing the first occurrence matchObj = re.match(r'(.*) are (.*?) .*', line, re.M | re.I) if matchObj: print("matchObj.group() :", matchObj.group()) print("matchObj.group(1) :", matchObj.group(1)) print("matchObj.group(2) :", matchObj.group(2)) else: print("No match!!")
The output of the above example is:
matchObj.group() : Cats are smarter than dogs matchObj.group(1) : Cats matchObj.group(2) : smarter
re.search()
Function
The re.search()
function scans the entire string and returns the first successful match.
Function Syntax:
re.search(pattern, string, flags=0)
Function Parameters:
Parameter | Description |
---|---|
pattern | The regular expression to match. |
string | The string to be matched. |
flags | Flag parameters to control the matching behavior (e.g., case sensitivity, multi-line matching, etc.). Refer to Regular Expression Modifiers - Optional Flags. |
If the match is successful, the re.search()
method returns a match object; otherwise, it returns None
.
You can use the group(num)
or groups()
methods of the match object to get the matched parts of the expression.
Example
#!/usr/bin/python3 import re print(re.search('www', 'www.pmeve.com').span()) # Matches at the start print(re.search('com', 'www.pmeve.com').span()) # Matches at a later position
The output of the above example is:
(0, 3) (11, 14)
Another Example
#!/usr/bin/python3 import re line = "Cats are smarter than dogs" searchObj = re.search(r'(.*) are (.*?) .*', line, re.M | re.I) if searchObj: print("searchObj.group() :", searchObj.group()) print("searchObj.group(1) :", searchObj.group(1)) print("searchObj.group(2) :", searchObj.group(2)) else: print("Nothing found!!")
The output of the above example is:
searchObj.group() : Cats are smarter than dogs searchObj.group(1) : Cats searchObj.group(2) : smarter
Differences Between re.match()
and re.search()
The re.match()
function only matches the beginning of a string. If the string does not start with the pattern, the match fails and the function returns None
. On the other hand, re.search()
scans the entire string until it finds a match.
Example
#!/usr/bin/python3 import re line = "Cats are smarter than dogs" # Using re.match matchObj = re.match(r'dogs', line, re.M | re.I) if matchObj: print("match --> matchObj.group() :", matchObj.group()) else: print("No match!!") # Using re.search matchObj = re.search(r'dogs', line, re.M | re.I) if matchObj: print("search --> matchObj.group() :", matchObj.group()) else: print("No match!!")
Output:
No match!! search --> matchObj.group() : dogs
Search and Replace
Python's re
module provides the re.sub()
function for replacing matching patterns in a string.
Syntax:
re.sub(pattern, repl, string, count=0, flags=0)
Parameters:
pattern: The regular expression pattern to match.
repl: The string to replace the matched pattern, or a function.
string: The original string to search and replace within.
count: The maximum number of replacements (default is 0, meaning replace all matches).
flags: Optional flags that modify the matching behavior (e.g., case sensitivity).
Example
#!/usr/bin/python3 import re phone = "2004-959-559 # This is a phone number" # Remove comments num = re.sub(r'#.*$', "", phone) print("Phone number:", num) # Remove non-digit characters num = re.sub(r'\D', "", phone) print("Phone number:", num)
Output:
Phone number: 2004-959-559 Phone number: 2004959559
Using a Function in the repl
Parameter
In the following example, the matched digits are multiplied by 2:
Example
#!/usr/bin/python import re # Function to multiply matched digits by 2 def double(matched): value = int(matched.group('value')) return str(value * 2) s = 'A23G4HFD567' print(re.sub(r'(?P<value>\d+)', double, s))
Output:
A46G8HFD1134
compile()
Function
The compile()
function is used to compile a regular expression pattern into a regular expression object. This object can be reused with the match()
and search()
functions.
Syntax:
re.compile(pattern[, flags])
Parameters:
pattern: The regular expression pattern as a string.
flags: Optional flags to modify the matching behavior, such as ignoring case or using multi-line mode.
Common Flags:
re.IGNORECASE
orre.I
: Ignore case during matching.re.L
: Make special sequences like\w
,\W
,\b
, etc., dependent on the current locale.re.MULTILINE
orre.M
: Multi-line matching; changes the behavior of^
and$
to match the start and end of each line.re.DOTALL
orre.S
: Make the dot (.
) match all characters, including newline.re.ASCII
: Make special sequences like\w
,\W
,\d
, etc., match only ASCII characters.re.VERBOSE
orre.X
: Ignore whitespace and comments in the pattern, making complex expressions easier to read.
You can use flags individually or combine them using the bitwise OR (|
) operator. For example, re.IGNORECASE | re.MULTILINE
enables both ignore-case and multi-line modes.
Example
import re pattern = re.compile(r'\d+') # Compile a pattern to match at least one digit m = pattern.match('one12twothree34four') # No match at the start print(m) # Output: None m = pattern.match('one12twothree34four', 3, 10) # Start matching from '1' print(m) # Returns a match object print(m.group(0)) # Output: '12' print(m.start(0)) # Output: 3 print(m.end(0)) # Output: 5 print(m.span(0)) # Output: (3, 5)
When a match is successful, a Match object is returned. The following methods can be used on the match object:
group([group1, ...])
: Returns the matched string(s). You can specify group numbers to get specific matched groups.start([group])
: Returns the start index of the matched substring (default is 0).end([group])
: Returns the end index of the matched substring (default is 0).span([group])
: Returns a tuple containing the start and end positions of the matched substring.
Let's Take a Look at Another Example
Example
>>> import re >>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I) # re.I stands for case-insensitive matching >>> m = pattern.match('Hello World Wide Web') >>> print(m) # Successfully matched, returns a Match object <_sre.SRE_Match object at 0x10bea83e8> >>> m.group(0) # Returns the entire matched substring 'Hello World' >>> m.span(0) # Returns the indices of the entire matched substring (0, 11) >>> m.group(1) # Returns the first group matched substring 'Hello' >>> m.span(1) # Returns the indices of the first group matched substring (0, 5) >>> m.group(2) # Returns the second group matched substring 'World' >>> m.span(2) # Returns the indices of the second group matched substring (6, 11) >>> m.groups() # Equivalent to (m.group(1), m.group(2), ...) ('Hello', 'World') >>> m.group(3) # No third group exists Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: no such group
findall
Function
The findall()
function finds all substrings in the string that match the regular expression pattern and returns them as a list. If there are multiple matching patterns, it returns a list of tuples. If no match is found, it returns an empty list.
Note: match()
and search()
find a single match, while findall()
finds all matches.
Syntax
re.findall(pattern, string, flags=0)
Or
pattern.findall(string[, pos[, endpos]])
Parameters
pattern: The regex pattern to match.
string: The string to search.
pos: (Optional) The starting position in the string. Default is
0
.endpos: (Optional) The ending position in the string. Default is the length of the string.
Example: Finding All Numbers in a String
import re result1 = re.findall(r'\d+', 'runoob 123 google 456') pattern = re.compile(r'\d+') # Compile pattern to find numbers result2 = pattern.findall('runoob 123 google 456') result3 = pattern.findall('run88oob123google456', 0, 10) print(result1) print(result2) print(result3)
Output:
['123', '456'] ['123', '456'] ['88', '12']
Example: Multiple Matching Patterns, Returning a List of Tuples
import re result = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10') print(result)
Output:
[('width', '20'), ('height', '10')]
re.finditer
Similar to findall()
, the finditer()
function finds all substrings in the string that match the regex pattern but returns them as an iterator of match objects.
Syntax
re.finditer(pattern, string, flags=0)
Parameters
pattern: The regex pattern to match.
string: The string to search.
flags: Optional flags to modify matching behavior, such as case-insensitivity.
Example
import re it = re.finditer(r"\d+", "12a32bc43jf3") for match in it: print(match.group())
Output:
12 32 43 3
re.split
The split()
function splits the string wherever the regex pattern matches, returning a list of substrings.
Syntax
re.split(pattern, string[, maxsplit=0, flags=0])
Parameters
pattern: The regex pattern to split by.
string: The string to split.
maxsplit: (Optional) The number of splits to make.
maxsplit=1
will split once. Default is0
(no limit).flags: Optional flags to modify matching behavior.
Example: Splitting on Non-Word Characters
>>> import re >>> re.split(r'\W+', 'runoob, runoob, runoob.') ['runoob', 'runoob', 'runoob', ''] >>> re.split(r'(\W+)', ' runoob, runoob, runoob.') ['', ' ', 'runoob', ', ', 'runoob', ', ', 'runoob', '.', ''] >>> re.split(r'\W+', ' runoob, runoob, runoob.', 1) ['', 'runoob, runoob, runoob.']
Example: Splitting a String That Does Not Match the Pattern
>>> re.split('a*', 'hello world') # No matching pattern, no split ['hello world']
Regular Expression Objects
re.RegexObject
The re.compile()
function returns a RegexObject
.
re.MatchObject
group()
: Returns the matched substring.start()
: Returns the starting position of the match.end()
: Returns the ending position of the match.span()
: Returns a tuple containing the (start, end) positions of the match.
Regular Expression Modifiers - Optional Flags
Regular expressions can include optional flags to control the matching behavior. These flags can either be used individually or combined using the bitwise OR (|
). For example, re.IGNORECASE | re.MULTILINE
enables both case-insensitive matching and multi-line mode.
Modifiers and Examples
Modifier | Description | Example |
---|---|---|
re.IGNORECASE or re.I | Makes the matching case-insensitive. | import re pattern = re.compile(r'apple', flags=re.IGNORECASE) result = pattern.match('Apple') print(result.group()) # Output: 'Apple' |
re.MULTILINE or re.M | Multi-line matching, affecting ^ and $ to match the start and end of each line. | import re pattern = re.compile(r'^\d+', flags=re.MULTILINE) text = '123\n456\n789' result = pattern.findall(text) print(result) # Output: ['123', '456', '789'] |
re.DOTALL or re.S | Makes the . match any character, including newlines. | import re pattern = re.compile(r'a.b', flags=re.DOTALL) result = pattern.match('a\nb') print(result.group()) # Output: 'a\nb' |
re.ASCII | Makes \w , \W , \b , \B , \d , \D , \s , \S only match ASCII characters. | import re pattern = re.compile(r'\w+', flags=re.ASCII) result = pattern.match('Hello123') print(result.group()) # Output: 'Hello123' |
re.VERBOSE or re.X | Ignores whitespace and comments, allowing complex regular expressions to be more readable. | import re pattern = re.compile(r''' \d+ # Match digits [a-z]+ # Match lowercase letters ''', flags=re.VERBOSE) result = pattern.match('123abc') print(result.group()) # Output: '123abc' |
Regular Expression Patterns
Pattern strings use special syntax to represent a regular expression.
Letters and digits: Represent themselves. A letter or digit in a pattern matches the same string.
Escape Sequences: Letters and digits have different meanings when preceded by a backslash.
Punctuation: Only matches itself when escaped; otherwise, it represents a special meaning.
Backslashes: Need to be escaped with another backslash.
Since regular expressions often contain backslashes, it's best to use raw strings to represent them. For example, the pattern r'\t'
matches the tab character, equivalent to \\t
.
The table below lists special elements in regular expression syntax. Some elements' meanings may change if optional flag parameters are provided.
Pattern | Description | Example |
---|---|---|
^ | Matches the beginning of a string. | ^abc matches "abcdef" but not "xyzabc". |
$ | Matches the end of a string. | abc$ matches "abcdef" but not "abcdefg". |
. | Matches any character except a newline. When re.DOTALL is specified, it matches any character including newlines. | a.b matches "aab", "a_b", "a\nb" (if re.DOTALL is used). |
[...] | Matches any one of the characters contained within the brackets. For example, [amk] matches 'a', 'm', or 'k'. | [abc] matches "a", "b", or "c". |
[^...] | Matches any character not in the brackets. For example, [^abc] matches any character except 'a', 'b', or 'c'. | [^abc] matches "d", "e", etc., but not "a", "b", or "c". |
re* | Matches 0 or more occurrences of the preceding expression. | a* matches "", "a", "aa", "aaa", etc. |
re+ | Matches 1 or more occurrences of the preceding expression. | a+ matches "a", "aa", "aaa", etc., but not "". |
re? | Matches 0 or 1 occurrence of the preceding expression, non-greedy. | a? matches "" or "a". |
re{n} | Matches exactly n occurrences of the preceding expression. | a{2} matches "aa" but not "a" or "aaa". |
re{n,} | Matches at least n occurrences of the preceding expression. | a{2,} matches "aa", "aaa", "aaaa", etc. |
re{n,m} | Matches between n and m occurrences of the preceding expression, greedy. | a{2,4} matches "aa", "aaa", "aaaa" but not "a" or "aaaaa". |
**`a | b`** | Matches either a or b. |
(re) | Matches the expression within the parentheses and also creates a group. | (abc) matches "abc" and creates a group for "abc". |
(?imx) | Applies the optional flags i, m, or x to the pattern inside the parentheses. | (?i)abc matches "ABC", "AbC", etc., case-insensitive. |
(?-imx) | Disables the optional flags i, m, or x within the parentheses. | (?-i)abc matches "abc" but not "ABC", case-sensitive. |
(?:re) | Similar to (re) , but does not create a group. | (?:abc) matches "abc" but does not create a capturing group. |
(?imx:re) | Applies optional flags i, m, or x to the pattern within the parentheses. | (?i:abc) matches "ABC", "AbC", etc., case-insensitive. |
(?-imx:re) | Disables optional flags i, m, or x within the parentheses. | (?-i:abc) matches "abc" but not "ABC", case-sensitive. |
(?#...) | Adds a comment. | (?#comment) is ignored by the regex engine. |
(?=re) | Positive lookahead. Matches if the contained expression can be matched from the current position. | a(?=b) matches "a" only if it is followed by "b". |
(?!re) | Negative lookahead. Matches if the contained expression cannot be matched from the current position. | a(?!b) matches "a" only if it is not followed by "b". |
(?>re) | Matches the contained expression independently, without backtracking. | (?>abc) matches "abc" as an independent unit. |
\w | Matches any alphanumeric character or underscore. | \w matches "a", "1", "_". |
\W | Matches any non-alphanumeric character. | \W matches "!", "@", "#". |
\s | Matches any whitespace character, equivalent to [ \t\n\r\f] . | \s matches " ", "\t", "\n". |
\S | Matches any non-whitespace character. | \S matches "a", "1", "!", but not " ". |
\d | Matches any digit, equivalent to [0-9] . | \d matches "0", "1", "9". |
\D | Matches any non-digit character. | \D matches "a", "!", " ". |
\A | Matches the start of the string. | \Aabc matches "abc" at the beginning of the string. |
\Z | Matches the end of the string, but if there is a newline, it matches only up to the newline. | abc\Z matches "abc" at the end of the string or before a newline. |
\z | Matches the end of the string. | abc\z matches "abc" at the end of the string. |
\G | Matches the position where the previous match ended. | \G matches the position where the previous match ended. |
\b | Matches a word boundary, i.e., the position between a word and a space. | \bword\b matches "word" in "word here" but not in "sword". |
\B | Matches a non-word boundary. | \Bword\B matches "word" in "sword" but not in "word here". |
\n, \t, etc. | Matches a newline, tab, etc. | \n matches a newline, \t matches a tab. |
\1...\9 | Matches the content of the nth capturing group. | (a)(b)\1 matches "ab a" but not "ab b". |
\10 | Matches the content of the nth capturing group if it exists; otherwise, it refers to an octal character code. | \10 matches the content of the 10th capturing group or the octal character code 10. |
Regular Expression Examples
Character Matching
Example | Description |
---|---|
python | Matches "python". |
Character Classes
Example | Description |
---|---|
[Pp]ython | Matches "Python" or "python". |
rub[ye] | Matches "ruby" or "rube". |
[aeiou] | Matches any single vowel. |
[0-9] | Matches any digit, equivalent to [0123456789] . |
[a-z] | Matches any lowercase letter. |
[A-Z] | Matches any uppercase letter. |
[a-zA-Z0-9] | Matches any letter or digit. |
[^aeiou] | Matches any character except the vowels. |
[^0-9] | Matches any character except digits. |
Special Character Classes
Example | Description |
---|---|
. | Matches any single character except \n . To match any character including \n , use a pattern like [.\n] . |
\d | Matches a digit character, equivalent to [0-9] . |
\D | Matches a non-digit character, equivalent to [^0-9] . |
\s | Matches any whitespace character, including spaces, tabs, newlines, etc., equivalent to [ \f\n\r\t\v] . |
\S | Matches any non-whitespace character, equivalent to [^ \f\n\r\t\v] . |
\w | Matches any word character, including underscores, equivalent to [A-Za-z0-9_] . |
\W | Matches any non-word character, equivalent to [^A-Za-z0-9_] . |