Chapter 9: Regular Expressions in Programming Languages

Haiyue
15min

Chapter 9: Regular Expressions in Programming Languages

Learning Objectives

  1. Master the use of RegExp object in JavaScript
  2. Learn Python’s re module operations
  3. Understand Java’s Pattern and Matcher classes
  4. Master PHP’s PCRE functions
  5. Understand syntax differences between languages

9.1 Regular Expressions in JavaScript

JavaScript provides powerful regular expression support, primarily through the RegExp object and related string methods.

RegExp Object

// Two ways to create regular expressions
const regex1 = /pattern/flags;              // Literal syntax
const regex2 = new RegExp('pattern', 'flags'); // Constructor syntax

// RegExp properties
const pattern = /hello/gi;
console.log(pattern.source);    // "hello"
console.log(pattern.flags);     // "gi"
console.log(pattern.global);    // true
console.log(pattern.ignoreCase); // true
console.log(pattern.multiline);  // false
console.log(pattern.lastIndex);  // 0

String Methods

const text = "Hello World, Hello Universe";

// match() - Returns match results
console.log(text.match(/hello/i));    // ["Hello"]
console.log(text.match(/hello/gi));   // ["Hello", "Hello"]

// search() - Returns match position
console.log(text.search(/world/i));   // 6

// replace() - Replaces matched content
console.log(text.replace(/hello/gi, 'Hi')); // "Hi World, Hi Universe"

// split() - Splits by pattern
console.log("a,b;c:d".split(/[,;:]/)); // ["a", "b", "c", "d"]

// matchAll() - ES2020 addition, returns iterator of all matches
const matches = [...text.matchAll(/(\w+)/g)];
matches.forEach(match => console.log(match[0]));

RegExp Methods

const pattern = /\d{2,4}/g;
const text = "12 345 6789";

// test() - Tests if matches
console.log(pattern.test(text)); // true

// exec() - Executes match
let match;
while ((match = pattern.exec(text)) !== null) {
    console.log(`Match: ${match[0]}, Position: ${match.index}`);
}
// Match: 12, Position: 0
// Match: 345, Position: 3
// Match: 6789, Position: 7

Practical Application Example

// Email Validator
class EmailValidator {
    constructor() {
        this.pattern = /^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/;
    }

    validate(email) {
        return this.pattern.test(email);
    }

    extractDomain(email) {
        const match = email.match(/@(.+)$/);
        return match ? match[1] : null;
    }
}

9.2 Python’s re Module

Python’s re module provides complete regular expression functionality.

Basic Functions

import re

# Compile regular expression
pattern = re.compile(r'\d+')

# match() - Matches from string beginning
result = re.match(r'\d+', '123abc')
if result:
    print(result.group())  # "123"

# search() - Searches for first match in entire string
result = re.search(r'\d+', 'abc123def')
if result:
    print(result.group())  # "123"

# findall() - Finds all matches
numbers = re.findall(r'\d+', '12 cats, 34 dogs, 56 birds')
print(numbers)  # ['12', '34', '56']

# finditer() - Returns iterator of match objects
for match in re.finditer(r'\d+', '12 cats, 34 dogs'):
    print(f"Match: {match.group()}, Position: {match.start()}-{match.end()}")

# sub() - Replaces matched content
text = re.sub(r'\d+', '[number]', '12 cats, 34 dogs')
print(text)  # "[number] cats, [number] dogs"

# split() - Splits string by pattern
parts = re.split(r'[,;:]', 'a,b;c:d')
print(parts)  # ['a', 'b', 'c', 'd']

Groups and Captures

# Group capturing
text = "John Doe, 25 years old"
pattern = r'(\w+) (\w+), (\d+) years old'
match = re.search(pattern, text)

if match:
    print(match.group(0))  # Complete match
    print(match.group(1))  # "John"
    print(match.group(2))  # "Doe"
    print(match.group(3))  # "25"
    print(match.groups())  # ('John', 'Doe', '25')

# Named groups
pattern = r'(?P<first>\w+) (?P<last>\w+), (?P<age>\d+) years old'
match = re.search(pattern, text)
if match:
    print(match.groupdict())  # {'first': 'John', 'last': 'Doe', 'age': '25'}

Compilation Options

# Common flags
flags = {
    're.IGNORECASE': re.IGNORECASE,  # or re.I, case insensitive
    're.MULTILINE': re.MULTILINE,    # or re.M, multiline mode
    're.DOTALL': re.DOTALL,          # or re.S, dot matches newline
    're.VERBOSE': re.VERBOSE,        # or re.X, verbose mode
    're.ASCII': re.ASCII,            # or re.A, ASCII mode
}

# Verbose mode example
verbose_pattern = re.compile(r'''
    ^                    # Start of line
    ([a-zA-Z0-9._%+-]+)  # Username part
    @                    # @ symbol
    ([a-zA-Z0-9.-]+)     # Domain part
    \.                   # Dot
    ([a-zA-Z]{2,})       # Top-level domain
    $                    # End of line
''', re.VERBOSE)

Practical Application Example

class LogAnalyzer:
    def __init__(self):
        # Log format: timestamp [level] message
        self.log_pattern = re.compile(
            r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) '
            r'\[(?P<level>\w+)\] '
            r'(?P<message>.*)'
        )

    def parse_log(self, log_line):
        match = self.log_pattern.match(log_line)
        if match:
            return match.groupdict()
        return None

    def filter_logs(self, log_content, level='ERROR'):
        lines = log_content.split('\n')
        filtered = []

        for line in lines:
            parsed = self.parse_log(line)
            if parsed and parsed['level'] == level:
                filtered.append(parsed)

        return filtered

# Usage example
analyzer = LogAnalyzer()
log_data = """2023-12-01 10:00:00 [INFO] Application started
2023-12-01 10:01:00 [ERROR] Database connection failed
2023-12-01 10:02:00 [WARN] Memory usage high"""

errors = analyzer.filter_logs(log_data, 'ERROR')
print(errors)

9.3 Regular Expressions in Java

Java provides regular expression support through the java.util.regex package.

Pattern and Matcher Classes

import java.util.regex.*;

public class RegexExample {
    public static void main(String[] args) {
        // Compile pattern
        Pattern pattern = Pattern.compile("\\d+");

        // Create matcher
        String text = "12 cats, 34 dogs, 56 birds";
        Matcher matcher = pattern.matcher(text);

        // Find all matches
        while (matcher.find()) {
            System.out.println("Match: " + matcher.group() +
                             ", Position: " + matcher.start() + "-" + matcher.end());
        }

        // Replace matches
        String replaced = pattern.matcher(text).replaceAll("[number]");
        System.out.println(replaced);

        // Split string
        Pattern splitPattern = Pattern.compile("[,;:]");
        String[] parts = splitPattern.split("a,b;c:d");
        for (String part : parts) {
            System.out.println(part);
        }
    }
}

Group Operations

public class GroupExample {
    public static void main(String[] args) {
        String text = "John Doe, 25 years old";
        Pattern pattern = Pattern.compile("(\\w+) (\\w+), (\\d+) years old");
        Matcher matcher = pattern.matcher(text);

        if (matcher.find()) {
            System.out.println("Full match: " + matcher.group(0));
            System.out.println("First name: " + matcher.group(1));
            System.out.println("Last name: " + matcher.group(2));
            System.out.println("Age: " + matcher.group(3));
            System.out.println("Group count: " + matcher.groupCount());
        }
    }
}

Utility Class

public class RegexUtils {
    // Email validation
    private static final Pattern EMAIL_PATTERN = Pattern.compile(
        "^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"
    );

    // Phone validation (China)
    private static final Pattern PHONE_PATTERN = Pattern.compile("^1[3-9]\\d{9}$");

    public static boolean isValidEmail(String email) {
        return EMAIL_PATTERN.matcher(email).matches();
    }

    public static boolean isValidPhone(String phone) {
        return PHONE_PATTERN.matcher(phone).matches();
    }

    // Extract URLs
    public static List<String> extractUrls(String text) {
        Pattern urlPattern = Pattern.compile(
            "https?://[\\w\\.-]+(?:\\:[0-9]+)?(?:/[\\w\\._~:/?#\\[\\]@!$&'()*+,;=-]*)?",
            Pattern.CASE_INSENSITIVE
        );

        Matcher matcher = urlPattern.matcher(text);
        List<String> urls = new ArrayList<>();

        while (matcher.find()) {
            urls.add(matcher.group());
        }

        return urls;
    }
}

9.4 Regular Expressions in PHP

PHP provides PCRE (Perl Compatible Regular Expressions) functions.

Basic Functions

<?php
// preg_match - Execute match
$text = "Hello World 123";
if (preg_match('/\d+/', $text, $matches)) {
    echo "Match: " . $matches[0] . "\n";  // "123"
}

// preg_match_all - Execute global match
$text = "12 cats, 34 dogs, 56 birds";
preg_match_all('/\d+/', $text, $matches);
print_r($matches[0]);  // Array([0] => 12 [1] => 34 [2] => 56)

// preg_replace - Replace matches
$result = preg_replace('/\d+/', '[number]', $text);
echo $result . "\n";  // "[number] cats, [number] dogs, [number] birds"

// preg_split - Split string
$parts = preg_split('/[,;:]/', 'a,b;c:d');
print_r($parts);  // Array([0] => a [1] => b [2] => c [3] => d)

// preg_grep - Filter array
$items = array('apple', 'banana', 'cherry', 'date');
$filtered = preg_grep('/a/', $items);
print_r($filtered);  // Items containing letter 'a'
?>

Groups and Captures

<?php
$text = "John Doe, 25 years old";
$pattern = '/(\w+) (\w+), (\d+) years old/';

if (preg_match($pattern, $text, $matches)) {
    echo "Full match: " . $matches[0] . "\n";
    echo "First name: " . $matches[1] . "\n";
    echo "Last name: " . $matches[2] . "\n";
    echo "Age: " . $matches[3] . "\n";
}

// Named capture groups (PHP 7.2+)
$pattern = '/(?<first>\w+) (?<last>\w+), (?<age>\d+) years old/';
if (preg_match($pattern, $text, $matches)) {
    echo "First name: " . $matches['first'] . "\n";
    echo "Last name: " . $matches['last'] . "\n";
    echo "Age: " . $matches['age'] . "\n";
}
?>

Utility Class

<?php
class RegexValidator {
    // Email validation
    public static function validateEmail($email) {
        $pattern = '/^[a-zA-Z0-9.!#$%&\'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/';
        return preg_match($pattern, $email) === 1;
    }

    // Phone validation (China)
    public static function validatePhone($phone) {
        $pattern = '/^1[3-9]\d{9}$/';
        return preg_match($pattern, $phone) === 1;
    }

    // Extract all links
    public static function extractLinks($text) {
        $pattern = '/https?:\/\/[\w\.-]+(?:\:[0-9]+)?(?:\/[\w\._~:\/?#\[\]@!$&\'()*+,;=-]*)?/i';
        preg_match_all($pattern, $text, $matches);
        return $matches[0];
    }

    // Format phone number
    public static function formatPhone($phone) {
        $clean = preg_replace('/\D/', '', $phone);
        if (preg_match('/^(\d{3})(\d{4})(\d{4})$/', $clean, $matches)) {
            return $matches[1] . '-' . $matches[2] . '-' . $matches[3];
        }
        return $phone;
    }
}

// Usage example
echo RegexValidator::validateEmail('test@example.com') ? 'Valid' : 'Invalid';
echo "\n";

$links = RegexValidator::extractLinks('Visit https://example.com or http://test.org');
print_r($links);

echo RegexValidator::formatPhone('13812345678') . "\n";  // 138-1234-5678
?>

9.5 Comparison of Language Differences

Syntax Differences

// JavaScript - Less escaping
const jsRegex = /\d+/g;

// Java - Requires double escaping
Pattern javaPattern = Pattern.compile("\\d+");

// Python - Use raw strings
pattern = re.compile(r'\d+')

// PHP - Requires delimiters
$phpPattern = '/\d+/';

Modifier Differences

// JavaScript
const flags = 'gimsuvy';

// Python
flags = re.I | re.M | re.S

// Java
Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE)

// PHP
'/pattern/ims'

Feature Comparison

| Feature | JavaScript | Python | Java | PHP |
|---------|------------|--------|------|-----|
| Named groups | ✓ (ES2018) | ✓ | ✓ | ✓ (7.2+) |
| Lookbehind | ✓ (ES2018) | ✓ | ✓ | ✓ |
| Recursive patterns | ✗ | ✗ | ✗ | ✓ |
| Conditional patterns | ✗ | ✗ | ✗ | ✓ |
| Atomic groups | ✗ | ✗ | ✓ | ✓ |

9.6 Cross-Language Regular Expression Best Practices

Common Pattern Library

// Create cross-language compatible patterns
const patterns = {
    email: {
        js: /^[^\s@]+@[^\s@]+\.[^\s@]+$/,
        python: r'^[^\s@]+@[^\s@]+\.[^\s@]+$',
        java: "^[^\\s@]+@[^\\s@]+\\.[^\\s@]+$",
        php: '/^[^\s@]+@[^\s@]+\.[^\s@]+$/'
    },

    phone: {
        js: /^\d{3}-?\d{3}-?\d{4}$/,
        python: r'^\d{3}-?\d{3}-?\d{4}$',
        java: "^\\d{3}-?\\d{3}-?\\d{4}$",
        php: '/^\d{3}-?\d{3}-?\d{4}$/'
    }
};

Unified Validation Functions

// JavaScript version
function createValidator(patterns) {
    return {
        email: (value) => patterns.email.js.test(value),
        phone: (value) => patterns.phone.js.test(value)
    };
}
# Python version
import re

def create_validator(patterns):
    compiled = {
        key: re.compile(patterns[key]['python'])
        for key in patterns
    }

    return {
        'email': lambda value: compiled['email'].match(value) is not None,
        'phone': lambda value: compiled['phone'].match(value) is not None
    }

Summary

This chapter detailed the use of regular expressions in mainstream programming languages:

  1. JavaScript: RegExp object, string methods, ES2018 new features
  2. Python: re module, compilation options, named groups
  3. Java: Pattern/Matcher classes, static methods, performance optimization
  4. PHP: PCRE functions, delimiters, rich advanced features
  5. Language Differences: Comparison of syntax, escaping, modifiers, and feature support
  6. Best Practices: Cross-language compatibility, common pattern design

Understanding these differences helps in correctly using regular expressions in different language environments and writing maintainable code.