Regular Expressions


 

Regular Expressions (re module)

Regular expressions, often shortened to "regex" or "regexp," are powerful tools for pattern matching and text manipulation in programming. In Python, the re module provides built-in support for working with regular expressions. These patterns are essentially sequences of characters that define a search pattern, allowing you to efficiently find, replace, or extract specific parts of strings. Python's re module is a fundamental skill for data scientists, web developers, and anyone dealing with text processing tasks, offering highly optimized ways to handle complex string operations.

 

Basic Patterns (characters, quantifiers)

Basic regular expression patterns consist of literal characters and special characters that act as metacharacters or quantifiers. Literal characters match themselves directly (e.g., 'a' matches 'a'). Metacharacters have special meanings, allowing you to define more flexible patterns. Quantifiers specify how many times a character or group of characters should appear. Understanding these building blocks is crucial for writing effective regex patterns in Python for tasks like data validation or log file parsing.

 

Example 1: Literal Characters

import re

text = "The quick brown fox jumps over the lazy dog."
# This regex will match the exact literal string "fox"
pattern = r"fox" 

# re.search() attempts to find the pattern anywhere in the string
match = re.search(pattern, text)

if match:
    print(f"Match found: '{match.group()}' at position {match.start()}")
else:
    print("No match found.")

Explanation: This example demonstrates the simplest form of a regular expression: matching a literal string. The pattern = r"fox" defines that we are looking for the exact sequence of characters "fox". The r prefix before the string denotes a "raw string," which is highly recommended for regular expressions in Python. It prevents backslashes from being interpreted as escape sequences, ensuring that special characters in your regex are treated literally or as intended by the regex engine. re.search() is then used to scan the text for the first occurrence of this pattern. If a match is found, match.group() returns the matched string, and match.start() gives its starting index. This is a common technique for finding specific words in text using Python regex.

 

Example 2: Metacharacters - Dot (.)

import re

text = "cat, cot, cut, c@t, c1t"
# The dot '.' matches any single character (except newline)
pattern = r"c.t" 

# re.findall() finds all non-overlapping matches of the pattern in the string
matches = re.findall(pattern, text)

print(f"Matches found: {matches}")

Explanation: Here, we introduce a basic metacharacter: the dot (.). In regex, . matches any single character except for a newline. The pattern r"c.t" will therefore match "cat", "cot", "cut", "c@t", and "c1t" from the text string. This illustrates how metacharacters provide flexibility beyond literal matching, making them powerful for pattern recognition and data extraction in Python. re.findall() is useful when you need to retrieve all occurrences of a pattern, not just the first one.

 

Example 3: Quantifiers - Asterisk (*)

import re

text = "abb, abbb, ab, a"
# The asterisk '*' matches the preceding character zero or more times
pattern = r"ab*a"

matches = re.findall(pattern, text)

print(f"Matches found: {matches}")

Explanation: This example showcases the quantifier *. The * quantifier means "zero or more occurrences" of the character or group immediately preceding it. So, r"ab*a" will match "aa" (zero 'b's), "aba" (one 'b'), and "abba" (two 'b's), etc. In the given text, it will find "aa" (if present), "aba" (if present), and "abba" (if present). This is very useful for flexible string matching where the number of repetitions of a character or sequence can vary, a common scenario in parsing text data.

 

Example 4: Quantifiers - Plus (+)

import re

text = "color, colour, coor"
# The plus '+' matches the preceding character one or more times
pattern = r"colou?r" 

matches = re.findall(pattern, text)

print(f"Matches found: {matches}")

Explanation: The + quantifier matches the preceding character one or more times. The pattern r"colou?r" uses the ? quantifier, which means "zero or one occurrence" of the preceding character. So, it will match both "color" and "colour". This is particularly useful when dealing with variations in spelling or optional characters in a string. For example, when normalizing text data or extracting information from documents with inconsistent formatting, + and ? are invaluable tools for robust pattern matching.

 

Example 5: Character Sets ([])

import re

text = "The price is $10.99, and the discount is 25%."
# A character set '[]' matches any one of the characters inside it
pattern = r"[0-9]+" 

matches = re.findall(pattern, text)

print(f"Numbers found: {matches}")

Explanation: Character sets, denoted by square brackets [], allow you to match any single character within the set. [0-9] matches any digit from 0 to 9. Combined with the + quantifier, [0-9]+ will match one or more digits. This example effectively extracts all sequences of numbers from the text, demonstrating its utility for extracting numerical data or validating input fields. Character sets are fundamental for defining specific ranges of characters to match, making them essential for data cleaning and information extraction tasks.

 

 

Matching and Searching (re.match(), re.search(), re.findall(), re.finditer())

The re module provides several functions for finding patterns within strings, each with a distinct behavior tailored for different use cases. Understanding the differences between re.match(), re.search(), re.findall(), and re.finditer() is critical for efficient string manipulation and data extraction in Python. These functions are at the core of regex applications in areas like web scraping, log analysis, and data parsing.

 

Example 1: re.match()

 

import re

text = "Hello, world!"
# re.match() only checks for a match at the beginning of the string
pattern = r"Hello" 

match = re.match(pattern, text)

if match:
    print(f"Match found at the beginning: '{match.group()}'")
else:
    print("No match found at the beginning.")

text_2 = "World, hello!"
match_2 = re.match(pattern, text_2)

if match_2:
    print(f"Match found at the beginning for text_2: '{match_2.group()}'")
else:
    print("No match found at the beginning for text_2.")

Explanation: re.match() attempts to match a pattern only at the beginning of the string. If the pattern is not found at the very first character, re.match() will return None. In this example, r"Hello" successfully matches the beginning of "Hello, world!", but it fails to match "World, hello!" because "Hello" is not at the start of the second string. This function is ideal for validating string prefixes or ensuring a string starts with a specific pattern.

 

Example 2: re.search()

import re

text = "The quick brown fox jumps over the lazy dog."
# re.search() scans the entire string for the first occurrence of the pattern
pattern = r"fox" 

match = re.search(pattern, text)

if match:
    print(f"First match found: '{match.group()}' at position {match.start()}")
else:
    print("No match found.")

text_2 = "A lazy dog and a quick brown fox."
match_2 = re.search(pattern, text_2)

if match_2:
    print(f"First match for text_2 found: '{match_2.group()}' at position {match_2.start()}")

Explanation: Unlike re.match(), re.search() scans the entire string from left to right, returning the first non-overlapping match it finds. In both text and text_2, the pattern r"fox" is successfully found, even though it's not at the beginning of text_2. This makes re.search() incredibly versatile for general pattern finding within any part of a string, a common task in data extraction and log file analysis.

 

Example 3: re.findall()

import re

text = "Emails: user1@example.com, user2@domain.org, another.user@mail.net"
# re.findall() returns a list of all non-overlapping matches
pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" 

emails = re.findall(pattern, text)

print(f"Found emails: {emails}")

Explanation: re.findall() is designed to find all non-overlapping matches of a pattern in a string and return them as a list of strings. This example uses a more complex regex to extract email addresses. The pattern r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" is a common email regex pattern. \b signifies a word boundary, ensuring we match whole email addresses. [A-Za-z0-9._%+-]+ matches the username part, @ matches the literal '@', [A-Za-z0-9.-]+ matches the domain, and \. matches a literal dot, followed by [A-Z|a-z]{2,} for the top-level domain (two or more letters). This function is perfect for extracting multiple data points from a larger text, like parsing log files or collecting specific information from web pages.

 

Example 4: re.finditer()

import re

text = "Dates: 2023-01-15, 2024-03-22, 2025-11-01"
# re.finditer() returns an iterator yielding match objects for all matches
pattern = r"\d{4}-\d{2}-\d{2}" 

date_matches = re.finditer(pattern, text)

print("Found dates with their positions:")
for match in date_matches:
    print(f"  '{match.group()}' found from index {match.start()} to {match.end()}")

Explanation: re.finditer() is similar to re.findall() but instead of returning a list of strings, it returns an iterator of match objects. Each match object provides more detailed information about the match, such as its exact starting and ending positions (match.start(), match.end()) in the original string, in addition to the matched string itself (match.group()). The pattern r"\d{4}-\d{2}-\d{2}" matches dates in the format YYYY-MM-DD. \d matches any digit, and {n} is a quantifier specifying exactly n occurrences. This is valuable when you need not only the matched text but also its precise location, often used in advanced text parsing and data annotation tasks.

 

Example 5: Combining re.search() with Groups

import re

text = "Name: John Doe, Age: 30, City: New York"
# Using parentheses to create capturing groups
pattern = r"Name: (\w+ \w+), Age: (\d+), City: (\w+ \w+)" 

match = re.search(pattern, text)

if match:
    # Accessing captured groups by index
    name = match.group(1)
    age = match.group(2)
    city = match.group(3)
    print(f"Extracted Information:")
    print(f"  Name: {name}")
    print(f"  Age: {age}")
    print(f"  City: {city}")
else:
    print("No matching information found.")

Explanation: This example demonstrates the power of capturing groups using parentheses () within a regex, combined with re.search(). Each set of parentheses defines a group that captures the text matched by the pattern inside them. \w+ matches one or more word characters, and \d+ matches one or more digits. After re.search() finds a match for the entire pattern, match.group(1), match.group(2), etc., are used to retrieve the text captured by each respective group. This technique is fundamental for extracting specific pieces of information from structured or semi-structured text, a common requirement in data extraction and parsing complex strings.

 

 

Substitution (re.sub())

The re.sub() function in Python's re module is a powerful tool for string substitution using regular expressions. It allows you to replace occurrences of a pattern in a string with a specified replacement string. This function is incredibly versatile for data cleaning, text normalization, and refactoring strings based on complex patterns. Whether you're sanitizing user input, transforming data formats, or anonymizing sensitive information, re.sub() is an essential part of the Python regex toolkit.

 

Example 1: Simple String Replacement

import re

text = "Hello world, hello Python!"
# Replace all occurrences of "hello" (case-insensitive) with "hi"
pattern = r"hello"
replacement = "hi"

# re.IGNORECASE flag makes the match case-insensitive
new_text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)

print(f"Original text: '{text}'")
print(f"Modified text: '{new_text}'")

Explanation: This basic example demonstrates replacing all occurrences of a specific word. The re.sub(pattern, replacement, string, flags=re.IGNORECASE) function takes the regex pattern, the replacement string, the original text, and optional flags. The re.IGNORECASE flag ensures that "hello" and "Hello" are both matched. This is a common operation for text normalization or standardizing terminology within a dataset.

 

Example 2: Replacing Digits with a Placeholder

import re

text = "User ID: 12345, Transaction: 987654"
# Replace any sequence of digits with "[REDACTED]"
pattern = r"\d+"
replacement = "[REDACTED]"

new_text = re.sub(pattern, replacement, text)

print(f"Original text: '{text}'")
print(f"Redacted text: '{new_text}'")

Explanation: This example shows how to use re.sub() for data anonymization or masking sensitive information. The pattern r"\d+" matches one or more digits. Every sequence of digits found in the text is replaced by the literal string "[REDACTED]". This is a practical application for securing data by hiding specific numerical identifiers in logs or reports.

 

Example 3: Using Backreferences in Replacement

import re

text = "Date: 2023-04-20, Time: 14:30"
# Reorder date parts from YYYY-MM-DD to DD/MM/YYYY
# Groups (1), (2), (3) capture year, month, day respectively
pattern = r"(\d{4})-(\d{2})-(\d{2})"
# \3 refers to the third captured group, \2 to the second, \1 to the first
replacement = r"\3/\2/\1"

new_text = re.sub(pattern, replacement, text)

print(f"Original text: '{text}'")
print(f"Formatted text: '{new_text}'")

Explanation: This advanced example demonstrates the power of backreferences in the replacement string. The pattern r"(\d{4})-(\d{2})-(\d{2})" uses three capturing groups to extract the year, month, and day respectively. In the replacement string, \3, \2, and \1 refer to the content captured by the third, second, and first groups. This allows you to reformat data by changing the order of matched elements, a common task in data transformation and standardization.

 

Example 4: Using a Function for Replacement

import re

text = "Value1: 10, Value2: 25, Value3: 5"

# Define a function to process each match
def double_value(match):
    # match.group(1) contains the captured digit
    original_value = int(match.group(1))
    doubled_value = original_value * 2
    return str(doubled_value)

# Pattern to capture digits after "ValueX: "
pattern = r"Value\d+: (\d+)" 

new_text = re.sub(pattern, double_value, text)

print(f"Original text: '{text}'")
print(f"Processed text: '{new_text}'")

Explanation: This highly flexible approach uses a function as the replacement argument in re.sub(). For each match found by the pattern, the double_value function is called, and its return value is used as the replacement. The match object passed to the function allows access to captured groups (e.g., match.group(1)). Here, we extract the numerical value, double it, and return the new string. This is invaluable for complex data manipulation and dynamic string replacement where the replacement logic depends on the matched content itself.

 

Example 5: Removing HTML Tags

import re

html_text = "<p>This is a <b>bold</b> paragraph with <a href='#'>a link</a>.</p>"
# Pattern to match any HTML tag
pattern = r"<[^>]+>" 
replacement = "" # Replace with an empty string to remove

cleaned_text = re.sub(pattern, replacement, html_text)

print(f"Original HTML: '{html_text}'")
print(f"Cleaned text: '{cleaned_text}'")

Explanation: A practical use case for re.sub() is cleaning text by removing unwanted elements, such as HTML tags. The pattern r"<[^>]+>" matches an opening angle bracket <, followed by one or more characters that are not a closing angle bracket [^>]+, and finally a closing angle bracket >. By replacing these matches with an empty string, all HTML tags are effectively stripped from the text. This is a common step in web scraping and text preprocessing to extract plain content from HTML sources.

 

 

Compiling Regular Expressions (re.compile())

The re.compile() function in Python's re module allows you to pre-compile a regular expression pattern into a regex object. While not strictly necessary for simple, one-off searches, compiling patterns offers significant performance benefits when you plan to use the same regex multiple times within your code. It optimizes the pattern for faster matching, making your Python regex operations more efficient, especially in applications that involve extensive text processing, large datasets, or frequent pattern lookups.

 

Example 1: Basic Compilation and Usage

import re

# Compile the regex pattern for "apple"
# This creates a regex object
apple_pattern = re.compile(r"apple") 

text1 = "I have an apple."
text2 = "Do you like apples?"
text3 = "No apples here."

# Use the compiled pattern's search method
match1 = apple_pattern.search(text1)
match2 = apple_pattern.search(text2) # No match for 'apples' if pattern is 'apple'
match3 = apple_pattern.search(text3)

if match1:
    print(f"Text 1: '{match1.group()}' found.")
else:
    print("Text 1: No match.")

if match2:
    print(f"Text 2: '{match2.group()}' found.")
else:
    print("Text 2: No match.") # Expected: No match

if match3:
    print(f"Text 3: '{match3.group()}' found.")
else:
    print("Text 3: No match.")

Explanation: This example demonstrates the basic usage of re.compile(). We compile the pattern r"apple" once, creating a regex object called apple_pattern. Subsequent searches against different strings (text1, text2, text3) then use the methods of this compiled object (e.g., apple_pattern.search()). This is more efficient than calling re.search(r"apple", text) repeatedly, as the pattern doesn't need to be re-parsed each time. It's best practice for optimizing regex performance in Python when the same pattern is reused.

 

Example 2: Compiling with Flags

import re

# Compile with IGNORECASE flag for case-insensitive matching
word_pattern = re.compile(r"python", re.IGNORECASE) 

text1 = "Python is great."
text2 = "I love python programming."
text3 = "PYTHON development."

match1 = word_pattern.search(text1)
match2 = word_pattern.search(text2)
match3 = word_pattern.search(text3)

print(f"Text 1 match: {match1.group() if match1 else 'None'}")
print(f"Text 2 match: {match2.group() if match2 else 'None'}")
print(f"Text 3 match: {match3.group() if match3 else 'None'}")

Explanation: Flags like re.IGNORECASE, re.MULTILINE, or re.DOTALL can also be passed to re.compile(). Here, re.IGNORECASE is included, making the compiled word_pattern match "python" regardless of its casing. This is beneficial because the flag is applied during the compilation phase, meaning it doesn't need to be processed with each individual search operation. Compiling with flags is a standard technique for flexible and performant pattern matching when case sensitivity or other global behaviors need to be set for the regex.

 

Example 3: Using Compiled Pattern for re.findall()

import re

text = "Colors: red, Green, BLUE, yellow, RED, blue"
# Compile a pattern to find colors, case-insensitive
color_pattern = re.compile(r"(red|green|blue)", re.IGNORECASE) 

found_colors = color_pattern.findall(text)

print(f"Original text: '{text}'")
print(f"Found colors: {found_colors}")

Explanation: This example shows how to use a compiled pattern with re.findall(). The color_pattern is compiled to find "red", "green", or "blue" (case-insensitive). The findall() method of the compiled object then returns all non-overlapping matches. This approach keeps your code cleaner by separating pattern definition from its usage and boosts performance when searching for multiple occurrences of a specific set of words, common in text mining and keyword extraction.

 

Example 4: Using Compiled Pattern for re.sub()

import re

text = "The quick brown fox. The quick red fox."
# Compile a pattern to replace "fox" with "dog", case-insensitive
replace_pattern = re.compile(r"fox", re.IGNORECASE) 
replacement_text = "dog"

new_text = replace_pattern.sub(replacement_text, text)

print(f"Original text: '{text}'")
print(f"Modified text: '{new_text}'")

Explanation: Just like with re.search() and re.findall(), re.compile() can be used with re.sub(). Here, the replace_pattern is compiled once. Subsequently, replace_pattern.sub() is used to perform the substitution. This is highly efficient when you need to perform the same replacement operation across many different strings or repeatedly within a single string, making it an excellent choice for large-scale text transformations and data sanitization.

 

Example 5: When to Use re.compile() (Performance Consideration)

import re
import time

long_text = "This is a long string with many words. " * 100000  # Create a very long string

# Scenario 1: Without compiling
start_time_uncompiled = time.time()
for _ in range(100):
    re.search(r"many", long_text)
end_time_uncompiled = time.time()
print(f"Time taken without compiling (100 searches): {end_time_uncompiled - start_time_uncompiled:.4f} seconds")

# Scenario 2: With compiling
compiled_pattern = re.compile(r"many")
start_time_compiled = time.time()
for _ in range(100):
    compiled_pattern.search(long_text)
end_time_compiled = time.time()
print(f"Time taken with compiling (100 searches): {end_time_compiled - start_time_compiled:.4f} seconds")

Explanation: This example visually demonstrates the performance benefit of re.compile(). We perform 100 re.search() operations on a very long string, once without compiling the pattern and once with a compiled pattern. You'll typically observe that the compiled version executes significantly faster. This is because the overhead of parsing and optimizing the regex pattern is incurred only once during compilation, rather than on every search operation. Therefore, for repeated regex operations on large inputs or within loops, pre-compiling your regular expressions is a crucial optimization for efficient Python programming.