Python Tutorial (34) - Regular Expressions

Time: Column:Python views:228

Regular Expressions in Python

Regular expressions are special character sequences that help you easily check whether a string matches a certain pattern.

In Python, the re module is used to work with regular expressions.

The re module provides a set of functions that allow you to perform pattern matching, searching, and replacement operations in strings.

The re module gives Python full regular expression capabilities.

This section mainly introduces commonly used regular expression handling functions in Python. If you're unfamiliar with regular expressions, you can refer to our Regular Expression Tutorial.


re.match() Function

The re.match() function attempts to match a pattern from the start of the string. If the match is successful at the start, it returns a match object; otherwise, it returns None.

Function Syntax:

re.match(pattern, string, flags=0)

Function Parameters:

ParameterDescription
patternThe regular expression to match.
stringThe string to be matched.
flagsFlag parameters to control the matching behavior (e.g., case sensitivity, multi-line matching, etc.). Refer to Regular Expression Modifiers - Optional Flags.

If the match is successful, the re.match() method returns a match object; otherwise, it returns None.

You can use the group(num) or groups() methods of the match object to get the matched parts of the expression.

Match Object Methods:

MethodDescription
group(num=0)Returns the entire matched expression. You can specify multiple group numbers to return a tuple containing the corresponding group values.
groups()Returns a tuple containing all the captured groups. Groups are numbered from 1 to the number of groups in the pattern.

Example

#!/usr/bin/python

import re
print(re.match('www', 'www.pmeve.com').span())  # Matches at the start
print(re.match('com', 'www.pmeve.com'))         # Does not match at the start

The output of the above example is:

(0, 3)
None

Another Example

#!/usr/bin/python3
import re

line = "Cats are smarter than dogs"

# .* matches any character except newlines (\n, \r)
# (.*?) is a non-greedy match, capturing the first occurrence
matchObj = re.match(r'(.*) are (.*?) .*', line, re.M | re.I)

if matchObj:
    print("matchObj.group() :", matchObj.group())
    print("matchObj.group(1) :", matchObj.group(1))
    print("matchObj.group(2) :", matchObj.group(2))
else:
    print("No match!!")

The output of the above example is:

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

re.search() Function

The re.search() function scans the entire string and returns the first successful match.

Function Syntax:

re.search(pattern, string, flags=0)

Function Parameters:

ParameterDescription
patternThe regular expression to match.
stringThe string to be matched.
flagsFlag parameters to control the matching behavior (e.g., case sensitivity, multi-line matching, etc.). Refer to Regular Expression Modifiers - Optional Flags.

If the match is successful, the re.search() method returns a match object; otherwise, it returns None.

You can use the group(num) or groups() methods of the match object to get the matched parts of the expression.

Example

#!/usr/bin/python3

import re

print(re.search('www', 'www.pmeve.com').span())  # Matches at the start
print(re.search('com', 'www.pmeve.com').span())  # Matches at a later position

The output of the above example is:

(0, 3)
(11, 14)

Another Example

#!/usr/bin/python3

import re

line = "Cats are smarter than dogs"

searchObj = re.search(r'(.*) are (.*?) .*', line, re.M | re.I)

if searchObj:
    print("searchObj.group() :", searchObj.group())
    print("searchObj.group(1) :", searchObj.group(1))
    print("searchObj.group(2) :", searchObj.group(2))
else:
    print("Nothing found!!")

The output of the above example is:

searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) :  Cats
searchObj.group(2) :  smarter

Differences Between re.match() and re.search()

The re.match() function only matches the beginning of a string. If the string does not start with the pattern, the match fails and the function returns None. On the other hand, re.search() scans the entire string until it finds a match.

Example

#!/usr/bin/python3

import re

line = "Cats are smarter than dogs"

# Using re.match
matchObj = re.match(r'dogs', line, re.M | re.I)
if matchObj:
    print("match --> matchObj.group() :", matchObj.group())
else:
    print("No match!!")

# Using re.search
matchObj = re.search(r'dogs', line, re.M | re.I)
if matchObj:
    print("search --> matchObj.group() :", matchObj.group())
else:
    print("No match!!")

Output:

No match!!
search --> matchObj.group() :  dogs

Search and Replace

Python's re module provides the re.sub() function for replacing matching patterns in a string.

Syntax:

re.sub(pattern, repl, string, count=0, flags=0)

Parameters:

  • pattern: The regular expression pattern to match.

  • repl: The string to replace the matched pattern, or a function.

  • string: The original string to search and replace within.

  • count: The maximum number of replacements (default is 0, meaning replace all matches).

  • flags: Optional flags that modify the matching behavior (e.g., case sensitivity).

Example

#!/usr/bin/python3
import re

phone = "2004-959-559 # This is a phone number"

# Remove comments
num = re.sub(r'#.*$', "", phone)
print("Phone number:", num)

# Remove non-digit characters
num = re.sub(r'\D', "", phone)
print("Phone number:", num)

Output:

Phone number:  2004-959-559
Phone number:  2004959559

Using a Function in the repl Parameter

In the following example, the matched digits are multiplied by 2:

Example

#!/usr/bin/python

import re

# Function to multiply matched digits by 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)

s = 'A23G4HFD567'
print(re.sub(r'(?P<value>\d+)', double, s))

Output:

A46G8HFD1134

compile() Function

The compile() function is used to compile a regular expression pattern into a regular expression object. This object can be reused with the match() and search() functions.

Syntax:

re.compile(pattern[, flags])

Parameters:

  • pattern: The regular expression pattern as a string.

  • flags: Optional flags to modify the matching behavior, such as ignoring case or using multi-line mode.

Common Flags:

  • re.IGNORECASE or re.I: Ignore case during matching.

  • re.L: Make special sequences like \w, \W, \b, etc., dependent on the current locale.

  • re.MULTILINE or re.M: Multi-line matching; changes the behavior of ^ and $ to match the start and end of each line.

  • re.DOTALL or re.S: Make the dot (.) match all characters, including newline.

  • re.ASCII: Make special sequences like \w, \W, \d, etc., match only ASCII characters.

  • re.VERBOSE or re.X: Ignore whitespace and comments in the pattern, making complex expressions easier to read.

You can use flags individually or combine them using the bitwise OR (|) operator. For example, re.IGNORECASE | re.MULTILINE enables both ignore-case and multi-line modes.

Example

import re

pattern = re.compile(r'\d+')  # Compile a pattern to match at least one digit

m = pattern.match('one12twothree34four')  # No match at the start
print(m)  # Output: None

m = pattern.match('one12twothree34four', 3, 10)  # Start matching from '1'
print(m)  # Returns a match object
print(m.group(0))  # Output: '12'
print(m.start(0))  # Output: 3
print(m.end(0))    # Output: 5
print(m.span(0))   # Output: (3, 5)

When a match is successful, a Match object is returned. The following methods can be used on the match object:

  • group([group1, ...]): Returns the matched string(s). You can specify group numbers to get specific matched groups.

  • start([group]): Returns the start index of the matched substring (default is 0).

  • end([group]): Returns the end index of the matched substring (default is 0).

  • span([group]): Returns a tuple containing the start and end positions of the matched substring.


Let's Take a Look at Another Example

Example

>>> import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)  # re.I stands for case-insensitive matching
>>> m = pattern.match('Hello World Wide Web')
>>> print(m)  # Successfully matched, returns a Match object
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)  # Returns the entire matched substring
'Hello World'
>>> m.span(0)  # Returns the indices of the entire matched substring
(0, 11)
>>> m.group(1)  # Returns the first group matched substring
'Hello'
>>> m.span(1)  # Returns the indices of the first group matched substring
(0, 5)
>>> m.group(2)  # Returns the second group matched substring
'World'
>>> m.span(2)  # Returns the indices of the second group matched substring
(6, 11)
>>> m.groups()  # Equivalent to (m.group(1), m.group(2), ...)
('Hello', 'World')
>>> m.group(3)  # No third group exists
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

findall Function

The findall() function finds all substrings in the string that match the regular expression pattern and returns them as a list. If there are multiple matching patterns, it returns a list of tuples. If no match is found, it returns an empty list.

Note: match() and search() find a single match, while findall() finds all matches.

Syntax

re.findall(pattern, string, flags=0)

Or

pattern.findall(string[, pos[, endpos]])

Parameters

  • pattern: The regex pattern to match.

  • string: The string to search.

  • pos: (Optional) The starting position in the string. Default is 0.

  • endpos: (Optional) The ending position in the string. Default is the length of the string.

Example: Finding All Numbers in a String

import re

result1 = re.findall(r'\d+', 'runoob 123 google 456')

pattern = re.compile(r'\d+')  # Compile pattern to find numbers
result2 = pattern.findall('runoob 123 google 456')
result3 = pattern.findall('run88oob123google456', 0, 10)

print(result1)
print(result2)
print(result3)

Output:

['123', '456']
['123', '456']
['88', '12']

Example: Multiple Matching Patterns, Returning a List of Tuples

import re

result = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
print(result)

Output:

[('width', '20'), ('height', '10')]

re.finditer

Similar to findall(), the finditer() function finds all substrings in the string that match the regex pattern but returns them as an iterator of match objects.

Syntax

re.finditer(pattern, string, flags=0)

Parameters

  • pattern: The regex pattern to match.

  • string: The string to search.

  • flags: Optional flags to modify matching behavior, such as case-insensitivity.

Example

import re

it = re.finditer(r"\d+", "12a32bc43jf3")
for match in it:
    print(match.group())

Output:

12
32
43
3

re.split

The split() function splits the string wherever the regex pattern matches, returning a list of substrings.

Syntax

re.split(pattern, string[, maxsplit=0, flags=0])

Parameters

  • pattern: The regex pattern to split by.

  • string: The string to split.

  • maxsplit: (Optional) The number of splits to make. maxsplit=1 will split once. Default is 0 (no limit).

  • flags: Optional flags to modify matching behavior.

Example: Splitting on Non-Word Characters

>>> import re
>>> re.split(r'\W+', 'runoob, runoob, runoob.')
['runoob', 'runoob', 'runoob', '']
>>> re.split(r'(\W+)', ' runoob, runoob, runoob.')
['', ' ', 'runoob', ', ', 'runoob', ', ', 'runoob', '.', '']
>>> re.split(r'\W+', ' runoob, runoob, runoob.', 1)
['', 'runoob, runoob, runoob.']

Example: Splitting a String That Does Not Match the Pattern

>>> re.split('a*', 'hello world')  # No matching pattern, no split
['hello world']

Regular Expression Objects

re.RegexObject

The re.compile() function returns a RegexObject.

re.MatchObject

  • group(): Returns the matched substring.

  • start(): Returns the starting position of the match.

  • end(): Returns the ending position of the match.

  • span(): Returns a tuple containing the (start, end) positions of the match.


Regular Expression Modifiers - Optional Flags

Regular expressions can include optional flags to control the matching behavior. These flags can either be used individually or combined using the bitwise OR (|). For example, re.IGNORECASE | re.MULTILINE enables both case-insensitive matching and multi-line mode.

Modifiers and Examples

ModifierDescriptionExample
re.IGNORECASE or re.IMakes the matching case-insensitive.import re
pattern = re.compile(r'apple', flags=re.IGNORECASE)
result = pattern.match('Apple')
print(result.group()) # Output: 'Apple'
re.MULTILINE or re.MMulti-line matching, affecting ^ and $ to match the start and end of each line.import re
pattern = re.compile(r'^\d+', flags=re.MULTILINE)
text = '123\n456\n789'
result = pattern.findall(text)
print(result) # Output: ['123', '456', '789']
re.DOTALL or re.SMakes the . match any character, including newlines.import re
pattern = re.compile(r'a.b', flags=re.DOTALL)
result = pattern.match('a\nb')
print(result.group()) # Output: 'a\nb'
re.ASCIIMakes \w, \W, \b, \B, \d, \D, \s, \S only match ASCII characters.import re
pattern = re.compile(r'\w+', flags=re.ASCII)
result = pattern.match('Hello123')
print(result.group()) # Output: 'Hello123'
re.VERBOSE or re.XIgnores whitespace and comments, allowing complex regular expressions to be more readable.import re
pattern = re.compile(r'''
\d+ # Match digits
[a-z]+ # Match lowercase letters
''', flags=re.VERBOSE)
result = pattern.match('123abc')
print(result.group()) # Output: '123abc'

Regular Expression Patterns

Pattern strings use special syntax to represent a regular expression.

  • Letters and digits: Represent themselves. A letter or digit in a pattern matches the same string.

  • Escape Sequences: Letters and digits have different meanings when preceded by a backslash.

  • Punctuation: Only matches itself when escaped; otherwise, it represents a special meaning.

  • Backslashes: Need to be escaped with another backslash.

Since regular expressions often contain backslashes, it's best to use raw strings to represent them. For example, the pattern r'\t' matches the tab character, equivalent to \\t.

The table below lists special elements in regular expression syntax. Some elements' meanings may change if optional flag parameters are provided.

PatternDescriptionExample
^Matches the beginning of a string.^abc matches "abcdef" but not "xyzabc".
$Matches the end of a string.abc$ matches "abcdef" but not "abcdefg".
.Matches any character except a newline. When re.DOTALL is specified, it matches any character including newlines.a.b matches "aab", "a_b", "a\nb" (if re.DOTALL is used).
[...]Matches any one of the characters contained within the brackets. For example, [amk] matches 'a', 'm', or 'k'.[abc] matches "a", "b", or "c".
[^...]Matches any character not in the brackets. For example, [^abc] matches any character except 'a', 'b', or 'c'.[^abc] matches "d", "e", etc., but not "a", "b", or "c".
re*Matches 0 or more occurrences of the preceding expression.a* matches "", "a", "aa", "aaa", etc.
re+Matches 1 or more occurrences of the preceding expression.a+ matches "a", "aa", "aaa", etc., but not "".
re?Matches 0 or 1 occurrence of the preceding expression, non-greedy.a? matches "" or "a".
re{n}Matches exactly n occurrences of the preceding expression.a{2} matches "aa" but not "a" or "aaa".
re{n,}Matches at least n occurrences of the preceding expression.a{2,} matches "aa", "aaa", "aaaa", etc.
re{n,m}Matches between n and m occurrences of the preceding expression, greedy.a{2,4} matches "aa", "aaa", "aaaa" but not "a" or "aaaaa".
**`ab`**Matches either a or b.
(re)Matches the expression within the parentheses and also creates a group.(abc) matches "abc" and creates a group for "abc".
(?imx)Applies the optional flags i, m, or x to the pattern inside the parentheses.(?i)abc matches "ABC", "AbC", etc., case-insensitive.
(?-imx)Disables the optional flags i, m, or x within the parentheses.(?-i)abc matches "abc" but not "ABC", case-sensitive.
(?:re)Similar to (re), but does not create a group.(?:abc) matches "abc" but does not create a capturing group.
(?imx:re)Applies optional flags i, m, or x to the pattern within the parentheses.(?i:abc) matches "ABC", "AbC", etc., case-insensitive.
(?-imx:re)Disables optional flags i, m, or x within the parentheses.(?-i:abc) matches "abc" but not "ABC", case-sensitive.
(?#...)Adds a comment.(?#comment) is ignored by the regex engine.
(?=re)Positive lookahead. Matches if the contained expression can be matched from the current position.a(?=b) matches "a" only if it is followed by "b".
(?!re)Negative lookahead. Matches if the contained expression cannot be matched from the current position.a(?!b) matches "a" only if it is not followed by "b".
(?>re)Matches the contained expression independently, without backtracking.(?>abc) matches "abc" as an independent unit.
\wMatches any alphanumeric character or underscore.\w matches "a", "1", "_".
\WMatches any non-alphanumeric character.\W matches "!", "@", "#".
\sMatches any whitespace character, equivalent to [ \t\n\r\f].\s matches " ", "\t", "\n".
\SMatches any non-whitespace character.\S matches "a", "1", "!", but not " ".
\dMatches any digit, equivalent to [0-9].\d matches "0", "1", "9".
\DMatches any non-digit character.\D matches "a", "!", " ".
\AMatches the start of the string.\Aabc matches "abc" at the beginning of the string.
\ZMatches the end of the string, but if there is a newline, it matches only up to the newline.abc\Z matches "abc" at the end of the string or before a newline.
\zMatches the end of the string.abc\z matches "abc" at the end of the string.
\GMatches the position where the previous match ended.\G matches the position where the previous match ended.
\bMatches a word boundary, i.e., the position between a word and a space.\bword\b matches "word" in "word here" but not in "sword".
\BMatches a non-word boundary.\Bword\B matches "word" in "sword" but not in "word here".
\n, \t, etc.Matches a newline, tab, etc.\n matches a newline, \t matches a tab.
\1...\9Matches the content of the nth capturing group.(a)(b)\1 matches "ab a" but not "ab b".
\10Matches the content of the nth capturing group if it exists; otherwise, it refers to an octal character code.\10 matches the content of the 10th capturing group or the octal character code 10.

Regular Expression Examples

Character Matching

ExampleDescription
pythonMatches "python".

Character Classes

ExampleDescription
[Pp]ythonMatches "Python" or "python".
rub[ye]Matches "ruby" or "rube".
[aeiou]Matches any single vowel.
[0-9]Matches any digit, equivalent to [0123456789].
[a-z]Matches any lowercase letter.
[A-Z]Matches any uppercase letter.
[a-zA-Z0-9]Matches any letter or digit.
[^aeiou]Matches any character except the vowels.
[^0-9]Matches any character except digits.

Special Character Classes

ExampleDescription
.Matches any single character except \n. To match any character including \n, use a pattern like [.\n].
\dMatches a digit character, equivalent to [0-9].
\DMatches a non-digit character, equivalent to [^0-9].
\sMatches any whitespace character, including spaces, tabs, newlines, etc., equivalent to [ \f\n\r\t\v].
\SMatches any non-whitespace character, equivalent to [^ \f\n\r\t\v].
\wMatches any word character, including underscores, equivalent to [A-Za-z0-9_].
\WMatches any non-word character, equivalent to [^A-Za-z0-9_].