Python Tutorial (33) - Example: Using regular expressions to extract URLs from a string

Time: Column:Python views:246

Extracting URLs from a Given String Using Regular Expressions

In this section, we will use a regular expression to extract URLs from a string.

Example:

import re 

def Find(string): 
    # findall() finds all substrings where the regex pattern matches
    url = re.findall('https?://(?:[-\\w.]|(?:%[\\da-fA-F]{2}))+', string)
    return url 

# Given string containing URLs
string = 'The URL of RMeve is: https://www.pmeve.com, and the URL of Google is: https://www.google.com'
print("URLs: ", Find(string))

Explanation:

  • https?://: Matches 'http' or 'https'.

  • (?:[-\\w.]|(?:%[\\da-fA-F]{2})): Matches URL-safe characters such as alphanumeric characters, hyphens, periods, or URL-encoded characters like %20.

Output:

URLs:  ['https://www.pmeve.com', 'https://www.google.com']

This code successfully extracts both URLs from the string.


Explanation of Non-capturing Group (?:x)

(?:x) is a non-capturing group. It matches the expression x but does not store the match for later use. This is useful for grouping parts of a regular expression without affecting back-references or capturing sub-patterns.

For example, in /foo{1,2}/, the {1,2} applies only to the last character o. However, in /(?:foo){1,2}/, the {1,2} applies to the entire word "foo".