Python Tutorial (34) - Regular Expressions

Time： 2024-09-12 Column：Python views：303

Regular Expressions in Python

Regular expressions are special character sequences that help you easily check whether a string matches a certain pattern.

In Python, the re module is used to work with regular expressions.

The re module provides a set of functions that allow you to perform pattern matching, searching, and replacement operations in strings.

The re module gives Python full regular expression capabilities.

This section mainly introduces commonly used regular expression handling functions in Python. If you're unfamiliar with regular expressions, you can refer to our Regular Expression Tutorial.

`re.match()` Function

The re.match() function attempts to match a pattern from the start of the string. If the match is successful at the start, it returns a match object; otherwise, it returns None.

Function Syntax:

re.match(pattern, string, flags=0)

Function Parameters:

Parameter	Description
`pattern`	The regular expression to match.
`string`	The string to be matched.
`flags`	Flag parameters to control the matching behavior (e.g., case sensitivity, multi-line matching, etc.). Refer to Regular Expression Modifiers - Optional Flags.

If the match is successful, the re.match() method returns a match object; otherwise, it returns None.

You can use the group(num) or groups() methods of the match object to get the matched parts of the expression.

Match Object Methods:

Method	Description
`group(num=0)`	Returns the entire matched expression. You can specify multiple group numbers to return a tuple containing the corresponding group values.
`groups()`	Returns a tuple containing all the captured groups. Groups are numbered from 1 to the number of groups in the pattern.

Example

#!/usr/bin/python

import re
print(re.match('www', 'www.pmeve.com').span())  # Matches at the start
print(re.match('com', 'www.pmeve.com'))         # Does not match at the start

The output of the above example is:

(0, 3)
None

Another Example

#!/usr/bin/python3
import re

line = "Cats are smarter than dogs"

# .* matches any character except newlines (\n, \r)
# (.*?) is a non-greedy match, capturing the first occurrence
matchObj = re.match(r'(.*) are (.*?) .*', line, re.M | re.I)

if matchObj:
    print("matchObj.group() :", matchObj.group())
    print("matchObj.group(1) :", matchObj.group(1))
    print("matchObj.group(2) :", matchObj.group(2))
else:
    print("No match!!")

The output of the above example is:

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

`re.search()` Function

The re.search() function scans the entire string and returns the first successful match.

Function Syntax:

re.search(pattern, string, flags=0)

Function Parameters:

Parameter	Description
`pattern`	The regular expression to match.
`string`	The string to be matched.
`flags`	Flag parameters to control the matching behavior (e.g., case sensitivity, multi-line matching, etc.). Refer to Regular Expression Modifiers - Optional Flags.

If the match is successful, the re.search() method returns a match object; otherwise, it returns None.

You can use the group(num) or groups() methods of the match object to get the matched parts of the expression.

Example

#!/usr/bin/python3

import re

print(re.search('www', 'www.pmeve.com').span())  # Matches at the start
print(re.search('com', 'www.pmeve.com').span())  # Matches at a later position

The output of the above example is:

(0, 3)
(11, 14)

Another Example

#!/usr/bin/python3

import re

line = "Cats are smarter than dogs"

searchObj = re.search(r'(.*) are (.*?) .*', line, re.M | re.I)

if searchObj:
    print("searchObj.group() :", searchObj.group())
    print("searchObj.group(1) :", searchObj.group(1))
    print("searchObj.group(2) :", searchObj.group(2))
else:
    print("Nothing found!!")

The output of the above example is:

searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) :  Cats
searchObj.group(2) :  smarter

Differences Between `re.match()` and `re.search()`

The re.match() function only matches the beginning of a string. If the string does not start with the pattern, the match fails and the function returns None. On the other hand, re.search() scans the entire string until it finds a match.

Example

#!/usr/bin/python3

import re

line = "Cats are smarter than dogs"

# Using re.match
matchObj = re.match(r'dogs', line, re.M | re.I)
if matchObj:
    print("match --> matchObj.group() :", matchObj.group())
else:
    print("No match!!")

# Using re.search
matchObj = re.search(r'dogs', line, re.M | re.I)
if matchObj:
    print("search --> matchObj.group() :", matchObj.group())
else:
    print("No match!!")

Output:

No match!!
search --> matchObj.group() :  dogs

Search and Replace

Python's re module provides the re.sub() function for replacing matching patterns in a string.

Syntax:

re.sub(pattern, repl, string, count=0, flags=0)

Parameters:

pattern: The regular expression pattern to match.
repl: The string to replace the matched pattern, or a function.
string: The original string to search and replace within.
count: The maximum number of replacements (default is 0, meaning replace all matches).
flags: Optional flags that modify the matching behavior (e.g., case sensitivity).

Example

#!/usr/bin/python3
import re

phone = "2004-959-559 # This is a phone number"

# Remove comments
num = re.sub(r'#.*$', "", phone)
print("Phone number:", num)

# Remove non-digit characters
num = re.sub(r'\D', "", phone)
print("Phone number:", num)

Output:

Phone number:  2004-959-559
Phone number:  2004959559

Using a Function in the `repl` Parameter

In the following example, the matched digits are multiplied by 2:

Example

#!/usr/bin/python

import re

# Function to multiply matched digits by 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)

s = 'A23G4HFD567'
print(re.sub(r'(?P<value>\d+)', double, s))

Output:

A46G8HFD1134

`compile()` Function

The compile() function is used to compile a regular expression pattern into a regular expression object. This object can be reused with the match() and search() functions.

Syntax:

re.compile(pattern[, flags])

Parameters:

pattern: The regular expression pattern as a string.
flags: Optional flags to modify the matching behavior, such as ignoring case or using multi-line mode.

Common Flags:

re.IGNORECASE or re.I: Ignore case during matching.
re.L: Make special sequences like \w, \W, \b, etc., dependent on the current locale.
re.MULTILINE or re.M: Multi-line matching; changes the behavior of ^ and $ to match the start and end of each line.
re.DOTALL or re.S: Make the dot (.) match all characters, including newline.
re.ASCII: Make special sequences like \w, \W, \d, etc., match only ASCII characters.
re.VERBOSE or re.X: Ignore whitespace and comments in the pattern, making complex expressions easier to read.

You can use flags individually or combine them using the bitwise OR (|) operator. For example, re.IGNORECASE | re.MULTILINE enables both ignore-case and multi-line modes.

Example

import re

pattern = re.compile(r'\d+')  # Compile a pattern to match at least one digit

m = pattern.match('one12twothree34four')  # No match at the start
print(m)  # Output: None

m = pattern.match('one12twothree34four', 3, 10)  # Start matching from '1'
print(m)  # Returns a match object
print(m.group(0))  # Output: '12'
print(m.start(0))  # Output: 3
print(m.end(0))    # Output: 5
print(m.span(0))   # Output: (3, 5)

When a match is successful, a Match object is returned. The following methods can be used on the match object:

group([group1, ...]): Returns the matched string(s). You can specify group numbers to get specific matched groups.
start([group]): Returns the start index of the matched substring (default is 0).
end([group]): Returns the end index of the matched substring (default is 0).
span([group]): Returns a tuple containing the start and end positions of the matched substring.

Let's Take a Look at Another Example

Example

>>> import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)  # re.I stands for case-insensitive matching
>>> m = pattern.match('Hello World Wide Web')
>>> print(m)  # Successfully matched, returns a Match object
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)  # Returns the entire matched substring
'Hello World'
>>> m.span(0)  # Returns the indices of the entire matched substring
(0, 11)
>>> m.group(1)  # Returns the first group matched substring
'Hello'
>>> m.span(1)  # Returns the indices of the first group matched substring
(0, 5)
>>> m.group(2)  # Returns the second group matched substring
'World'
>>> m.span(2)  # Returns the indices of the second group matched substring
(6, 11)
>>> m.groups()  # Equivalent to (m.group(1), m.group(2), ...)
('Hello', 'World')
>>> m.group(3)  # No third group exists
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

`findall` Function

The findall() function finds all substrings in the string that match the regular expression pattern and returns them as a list. If there are multiple matching patterns, it returns a list of tuples. If no match is found, it returns an empty list.

Note: match() and search() find a single match, while findall() finds all matches.

Syntax

re.findall(pattern, string, flags=0)

pattern.findall(string[, pos[, endpos]])

Parameters

pattern: The regex pattern to match.
string: The string to search.
pos: (Optional) The starting position in the string. Default is 0.
endpos: (Optional) The ending position in the string. Default is the length of the string.

Example: Finding All Numbers in a String

import re

result1 = re.findall(r'\d+', 'runoob 123 google 456')

pattern = re.compile(r'\d+')  # Compile pattern to find numbers
result2 = pattern.findall('runoob 123 google 456')
result3 = pattern.findall('run88oob123google456', 0, 10)

print(result1)
print(result2)
print(result3)

Output:

['123', '456']
['123', '456']
['88', '12']

Example: Multiple Matching Patterns, Returning a List of Tuples

import re

result = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
print(result)

Output:

[('width', '20'), ('height', '10')]

`re.finditer`

Similar to findall(), the finditer() function finds all substrings in the string that match the regex pattern but returns them as an iterator of match objects.

Syntax

re.finditer(pattern, string, flags=0)

Parameters

pattern: The regex pattern to match.
string: The string to search.
flags: Optional flags to modify matching behavior, such as case-insensitivity.

Example

import re

it = re.finditer(r"\d+", "12a32bc43jf3")
for match in it:
    print(match.group())

Output:

`re.split`

The split() function splits the string wherever the regex pattern matches, returning a list of substrings.

Syntax

re.split(pattern, string[, maxsplit=0, flags=0])

Parameters

pattern: The regex pattern to split by.
string: The string to split.
maxsplit: (Optional) The number of splits to make. maxsplit=1 will split once. Default is 0 (no limit).
flags: Optional flags to modify matching behavior.

Example: Splitting on Non-Word Characters

>>> import re
>>> re.split(r'\W+', 'runoob, runoob, runoob.')
['runoob', 'runoob', 'runoob', '']
>>> re.split(r'(\W+)', ' runoob, runoob, runoob.')
['', ' ', 'runoob', ', ', 'runoob', ', ', 'runoob', '.', '']
>>> re.split(r'\W+', ' runoob, runoob, runoob.', 1)
['', 'runoob, runoob, runoob.']

Example: Splitting a String That Does Not Match the Pattern

>>> re.split('a*', 'hello world')  # No matching pattern, no split
['hello world']

Regular Expression Objects

`re.RegexObject`

The re.compile() function returns a RegexObject.

`re.MatchObject`

group(): Returns the matched substring.
start(): Returns the starting position of the match.
end(): Returns the ending position of the match.
span(): Returns a tuple containing the (start, end) positions of the match.

Regular Expression Modifiers - Optional Flags

Regular expressions can include optional flags to control the matching behavior. These flags can either be used individually or combined using the bitwise OR (|). For example, re.IGNORECASE | re.MULTILINE enables both case-insensitive matching and multi-line mode.

Modifiers and Examples

Modifier	Description	Example
`re.IGNORECASE` or `re.I`	Makes the matching case-insensitive.	`import re` `pattern = re.compile(r'apple', flags=re.IGNORECASE)` `result = pattern.match('Apple')` `print(result.group()) # Output: 'Apple'`
`re.MULTILINE` or `re.M`	Multi-line matching, affecting `^` and `$` to match the start and end of each line.	`import re` `pattern = re.compile(r'^\d+', flags=re.MULTILINE)` `text = '123\n456\n789'` `result = pattern.findall(text)` `print(result) # Output: ['123', '456', '789']`
`re.DOTALL` or `re.S`	Makes the `.` match any character, including newlines.	`import re` `pattern = re.compile(r'a.b', flags=re.DOTALL)` `result = pattern.match('a\nb')` `print(result.group()) # Output: 'a\nb'`
`re.ASCII`	Makes `\w`, `\W`, `\b`, `\B`, `\d`, `\D`, `\s`, `\S` only match ASCII characters.	`import re` `pattern = re.compile(r'\w+', flags=re.ASCII)` `result = pattern.match('Hello123')` `print(result.group()) # Output: 'Hello123'`
`re.VERBOSE` or `re.X`	Ignores whitespace and comments, allowing complex regular expressions to be more readable.	`import re` `pattern = re.compile(r'''` `\d+ # Match digits` `[a-z]+ # Match lowercase letters` `''', flags=re.VERBOSE)` `result = pattern.match('123abc')` `print(result.group()) # Output: '123abc'`

Regular Expression Patterns

Pattern strings use special syntax to represent a regular expression.

Letters and digits: Represent themselves. A letter or digit in a pattern matches the same string.
Escape Sequences: Letters and digits have different meanings when preceded by a backslash.
Punctuation: Only matches itself when escaped; otherwise, it represents a special meaning.
Backslashes: Need to be escaped with another backslash.

Since regular expressions often contain backslashes, it's best to use raw strings to represent them. For example, the pattern r'\t' matches the tab character, equivalent to \\t.

The table below lists special elements in regular expression syntax. Some elements' meanings may change if optional flag parameters are provided.

Pattern	Description	Example
`^`	Matches the beginning of a string.	`^abc` matches "abcdef" but not "xyzabc".
`$`	Matches the end of a string.	`abc$` matches "abcdef" but not "abcdefg".
`.`	Matches any character except a newline. When `re.DOTALL` is specified, it matches any character including newlines.	`a.b` matches "aab", "a_b", "a\nb" (if `re.DOTALL` is used).
`[...]`	Matches any one of the characters contained within the brackets. For example, `[amk]` matches 'a', 'm', or 'k'.	`[abc]` matches "a", "b", or "c".
`[^...]`	Matches any character not in the brackets. For example, `[^abc]` matches any character except 'a', 'b', or 'c'.	`[^abc]` matches "d", "e", etc., but not "a", "b", or "c".
*`re`**	Matches 0 or more occurrences of the preceding expression.	`a*` matches "", "a", "aa", "aaa", etc.
`re+`	Matches 1 or more occurrences of the preceding expression.	`a+` matches "a", "aa", "aaa", etc., but not "".
`re?`	Matches 0 or 1 occurrence of the preceding expression, non-greedy.	`a?` matches "" or "a".
`re{n}`	Matches exactly n occurrences of the preceding expression.	`a{2}` matches "aa" but not "a" or "aaa".
`re{n,}`	Matches at least n occurrences of the preceding expression.	`a{2,}` matches "aa", "aaa", "aaaa", etc.
`re{n,m}`	Matches between n and m occurrences of the preceding expression, greedy.	`a{2,4}` matches "aa", "aaa", "aaaa" but not "a" or "aaaaa".
**`a	b`**	Matches either a or b.
`(re)`	Matches the expression within the parentheses and also creates a group.	`(abc)` matches "abc" and creates a group for "abc".
`(?imx)`	Applies the optional flags i, m, or x to the pattern inside the parentheses.	`(?i)abc` matches "ABC", "AbC", etc., case-insensitive.
`(?-imx)`	Disables the optional flags i, m, or x within the parentheses.	`(?-i)abc` matches "abc" but not "ABC", case-sensitive.
`(?:re)`	Similar to `(re)`, but does not create a group.	`(?:abc)` matches "abc" but does not create a capturing group.
`(?imx:re)`	Applies optional flags i, m, or x to the pattern within the parentheses.	`(?i:abc)` matches "ABC", "AbC", etc., case-insensitive.
`(?-imx:re)`	Disables optional flags i, m, or x within the parentheses.	`(?-i:abc)` matches "abc" but not "ABC", case-sensitive.
`(?#...)`	Adds a comment.	`(?#comment)` is ignored by the regex engine.
`(?=re)`	Positive lookahead. Matches if the contained expression can be matched from the current position.	`a(?=b)` matches "a" only if it is followed by "b".
`(?!re)`	Negative lookahead. Matches if the contained expression cannot be matched from the current position.	`a(?!b)` matches "a" only if it is not followed by "b".
`(?>re)`	Matches the contained expression independently, without backtracking.	`(?>abc)` matches "abc" as an independent unit.
`\w`	Matches any alphanumeric character or underscore.	`\w` matches "a", "1", "_".
`\W`	Matches any non-alphanumeric character.	`\W` matches "!", "@", "#".
`\s`	Matches any whitespace character, equivalent to `[ \t\n\r\f]`.	`\s` matches " ", "\t", "\n".
`\S`	Matches any non-whitespace character.	`\S` matches "a", "1", "!", but not " ".
`\d`	Matches any digit, equivalent to `[0-9]`.	`\d` matches "0", "1", "9".
`\D`	Matches any non-digit character.	`\D` matches "a", "!", " ".
`\A`	Matches the start of the string.	`\Aabc` matches "abc" at the beginning of the string.
`\Z`	Matches the end of the string, but if there is a newline, it matches only up to the newline.	`abc\Z` matches "abc" at the end of the string or before a newline.
`\z`	Matches the end of the string.	`abc\z` matches "abc" at the end of the string.
`\G`	Matches the position where the previous match ended.	`\G` matches the position where the previous match ended.
`\b`	Matches a word boundary, i.e., the position between a word and a space.	`\bword\b` matches "word" in "word here" but not in "sword".
`\B`	Matches a non-word boundary.	`\Bword\B` matches "word" in "sword" but not in "word here".
`\n, \t, etc.`	Matches a newline, tab, etc.	`\n` matches a newline, `\t` matches a tab.
`\1...\9`	Matches the content of the nth capturing group.	`(a)(b)\1` matches "ab a" but not "ab b".
`\10`	Matches the content of the nth capturing group if it exists; otherwise, it refers to an octal character code.	`\10` matches the content of the 10th capturing group or the octal character code 10.

Regular Expression Examples

Character Matching

Example	Description
`python`	Matches "python".

Character Classes

Example	Description
`[Pp]ython`	Matches "Python" or "python".
`rub[ye]`	Matches "ruby" or "rube".
`[aeiou]`	Matches any single vowel.
`[0-9]`	Matches any digit, equivalent to `[0123456789]`.
`[a-z]`	Matches any lowercase letter.
`[A-Z]`	Matches any uppercase letter.
`[a-zA-Z0-9]`	Matches any letter or digit.
`[^aeiou]`	Matches any character except the vowels.
`[^0-9]`	Matches any character except digits.

Special Character Classes

Example	Description
`.`	Matches any single character except `\n`. To match any character including `\n`, use a pattern like `[.\n]`.
`\d`	Matches a digit character, equivalent to `[0-9]`.
`\D`	Matches a non-digit character, equivalent to `[^0-9]`.
`\s`	Matches any whitespace character, including spaces, tabs, newlines, etc., equivalent to `[ \f\n\r\t\v]`.
`\S`	Matches any non-whitespace character, equivalent to `[^ \f\n\r\t\v]`.
`\w`	Matches any word character, including underscores, equivalent to `[A-Za-z0-9_]`.
`\W`	Matches any non-word character, equivalent to `[^A-Za-z0-9_]`.

💰 Support Us

Python

Prev：Python Tutorial (33) – Example: Creating a simple to-do list

Next：Python Tutorial (35) - Network Programming

Python Tutorial (34) - Regular Expressions

Regular Expressions in Python

re.match() Function

Function Syntax:

Function Parameters:

Match Object Methods:

Example

Another Example

re.search() Function

Function Syntax:

Function Parameters:

Example

Another Example

Differences Between re.match() and re.search()

Example

Search and Replace

Syntax:

Parameters:

Example

Using a Function in the repl Parameter

Example

compile() Function

Syntax:

Parameters:

Common Flags:

Example

Let's Take a Look at Another Example

Example

findall Function

Syntax

Parameters

Example: Finding All Numbers in a String

Example: Multiple Matching Patterns, Returning a List of Tuples

re.finditer

Syntax

Parameters

Example

re.split

Syntax

Parameters

Example: Splitting on Non-Word Characters

Example: Splitting a String That Does Not Match the Pattern

Regular Expression Objects

re.RegexObject

re.MatchObject

Regular Expression Modifiers - Optional Flags

Modifiers and Examples

Regular Expression Patterns

Regular Expression Examples

Character Matching

Character Classes

Special Character Classes

Share this article:

Recommended

`re.match()` Function

`re.search()` Function

Differences Between `re.match()` and `re.search()`

Using a Function in the `repl` Parameter

`compile()` Function

`findall` Function

`re.finditer`

`re.split`

`re.RegexObject`

`re.MatchObject`