第 9 章：编程语言中的正则表达式

Haiyue2025/9/1大约 7 分钟

第 9 章：编程语言中的正则表达式

学习目标

掌握JavaScript中的RegExp对象使用
学会Python的re模块操作
了解Java中的Pattern和Matcher类
掌握PHP中的PCRE函数
理解不同语言间的语法差异

9.1 JavaScript中的正则表达式

JavaScript提供了强大的正则表达式支持，主要通过RegExp对象和相关的字符串方法。

RegExp对象

// 创建正则表达式的两种方式
const regex1 = /pattern/flags;              // 字面量语法
const regex2 = new RegExp('pattern', 'flags'); // 构造函数语法

// RegExp属性
const pattern = /hello/gi;
console.log(pattern.source);    // "hello"
console.log(pattern.flags);     // "gi"
console.log(pattern.global);    // true
console.log(pattern.ignoreCase); // true
console.log(pattern.multiline);  // false
console.log(pattern.lastIndex);  // 0

字符串方法

const text = "Hello World, Hello Universe";

// match() - 返回匹配结果
console.log(text.match(/hello/i));    // ["Hello"]
console.log(text.match(/hello/gi));   // ["Hello", "Hello"]

// search() - 返回匹配位置
console.log(text.search(/world/i));   // 6

// replace() - 替换匹配内容
console.log(text.replace(/hello/gi, 'Hi')); // "Hi World, Hi Universe"

// split() - 按模式分割
console.log("a,b;c:d".split(/[,;:]/)); // ["a", "b", "c", "d"]

// matchAll() - ES2020新增，返回所有匹配的迭代器
const matches = [...text.matchAll(/(\w+)/g)];
matches.forEach(match => console.log(match[0]));

RegExp方法

const pattern = /\d{2,4}/g;
const text = "12 345 6789";

// test() - 测试是否匹配
console.log(pattern.test(text)); // true

// exec() - 执行匹配
let match;
while ((match = pattern.exec(text)) !== null) {
    console.log(`匹配: ${match[0]}, 位置: ${match.index}`);
}
// 匹配: 12, 位置: 0
// 匹配: 345, 位置: 3
// 匹配: 6789, 位置: 7

实际应用示例

// 邮箱验证器
class EmailValidator {
    constructor() {
        this.pattern = /^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/;
    }
    
    validate(email) {
        return this.pattern.test(email);
    }
    
    extractDomain(email) {
        const match = email.match(/@(.+)$/);
        return match ? match[1] : null;
    }
}

9.2 Python中的re模块

Python的re模块提供了完整的正则表达式功能。

基础函数

import re

# 编译正则表达式
pattern = re.compile(r'\d+')

# match() - 从字符串开头匹配
result = re.match(r'\d+', '123abc')
if result:
    print(result.group())  # "123"

# search() - 在整个字符串中搜索第一个匹配
result = re.search(r'\d+', 'abc123def')
if result:
    print(result.group())  # "123"

# findall() - 查找所有匹配
numbers = re.findall(r'\d+', '12 cats, 34 dogs, 56 birds')
print(numbers)  # ['12', '34', '56']

# finditer() - 返回匹配对象的迭代器
for match in re.finditer(r'\d+', '12 cats, 34 dogs'):
    print(f"匹配: {match.group()}, 位置: {match.start()}-{match.end()}")

# sub() - 替换匹配的内容
text = re.sub(r'\d+', '[数字]', '12 cats, 34 dogs')
print(text)  # "[数字] cats, [数字] dogs"

# split() - 按模式分割字符串
parts = re.split(r'[,;:]', 'a,b;c:d')
print(parts)  # ['a', 'b', 'c', 'd']

分组和捕获

# 分组捕获
text = "John Doe, 25 years old"
pattern = r'(\w+) (\w+), (\d+) years old'
match = re.search(pattern, text)

if match:
    print(match.group(0))  # 完整匹配
    print(match.group(1))  # "John"
    print(match.group(2))  # "Doe"
    print(match.group(3))  # "25"
    print(match.groups())  # ('John', 'Doe', '25')

# 命名分组
pattern = r'(?P<first>\w+) (?P<last>\w+), (?P<age>\d+) years old'
match = re.search(pattern, text)
if match:
    print(match.groupdict())  # {'first': 'John', 'last': 'Doe', 'age': '25'}

编译选项

# 常用标志
flags = {
    're.IGNORECASE': re.IGNORECASE,  # 或 re.I，不区分大小写
    're.MULTILINE': re.MULTILINE,    # 或 re.M，多行模式
    're.DOTALL': re.DOTALL,          # 或 re.S，点号匹配换行符
    're.VERBOSE': re.VERBOSE,        # 或 re.X，详细模式
    're.ASCII': re.ASCII,            # 或 re.A，ASCII模式
}

# 详细模式示例
verbose_pattern = re.compile(r'''
    ^                    # 行首
    ([a-zA-Z0-9._%+-]+)  # 用户名部分
    @                    # @ 符号
    ([a-zA-Z0-9.-]+)     # 域名部分
    \.                   # 点号
    ([a-zA-Z]{2,})       # 顶级域名
    $                    # 行尾
''', re.VERBOSE)

实际应用示例

class LogAnalyzer:
    def __init__(self):
        # 日志格式: timestamp [level] message
        self.log_pattern = re.compile(
            r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) '
            r'\[(?P<level>\w+)\] '
            r'(?P<message>.*)'
        )
    
    def parse_log(self, log_line):
        match = self.log_pattern.match(log_line)
        if match:
            return match.groupdict()
        return None
    
    def filter_logs(self, log_content, level='ERROR'):
        lines = log_content.split('\n')
        filtered = []
        
        for line in lines:
            parsed = self.parse_log(line)
            if parsed and parsed['level'] == level:
                filtered.append(parsed)
        
        return filtered

# 使用示例
analyzer = LogAnalyzer()
log_data = """2023-12-01 10:00:00 [INFO] Application started
2023-12-01 10:01:00 [ERROR] Database connection failed
2023-12-01 10:02:00 [WARN] Memory usage high"""

errors = analyzer.filter_logs(log_data, 'ERROR')
print(errors)

9.3 Java中的正则表达式

Java通过java.util.regex包提供正则表达式支持。

Pattern和Matcher类

import java.util.regex.*;

public class RegexExample {
    public static void main(String[] args) {
        // 编译模式
        Pattern pattern = Pattern.compile("\\d+");
        
        // 创建匹配器
        String text = "12 cats, 34 dogs, 56 birds";
        Matcher matcher = pattern.matcher(text);
        
        // 查找所有匹配
        while (matcher.find()) {
            System.out.println("匹配: " + matcher.group() + 
                             ", 位置: " + matcher.start() + "-" + matcher.end());
        }
        
        // 替换匹配
        String replaced = pattern.matcher(text).replaceAll("[数字]");
        System.out.println(replaced);
        
        // 分割字符串
        Pattern splitPattern = Pattern.compile("[,;:]");
        String[] parts = splitPattern.split("a,b;c:d");
        for (String part : parts) {
            System.out.println(part);
        }
    }
}

分组操作

public class GroupExample {
    public static void main(String[] args) {
        String text = "John Doe, 25 years old";
        Pattern pattern = Pattern.compile("(\\w+) (\\w+), (\\d+) years old");
        Matcher matcher = pattern.matcher(text);
        
        if (matcher.find()) {
            System.out.println("完整匹配: " + matcher.group(0));
            System.out.println("名字: " + matcher.group(1));
            System.out.println("姓氏: " + matcher.group(2));
            System.out.println("年龄: " + matcher.group(3));
            System.out.println("分组数: " + matcher.groupCount());
        }
    }
}

实用工具类

public class RegexUtils {
    // 邮箱验证
    private static final Pattern EMAIL_PATTERN = Pattern.compile(
        "^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"
    );
    
    // 手机号验证（中国）
    private static final Pattern PHONE_PATTERN = Pattern.compile("^1[3-9]\\d{9}$");
    
    public static boolean isValidEmail(String email) {
        return EMAIL_PATTERN.matcher(email).matches();
    }
    
    public static boolean isValidPhone(String phone) {
        return PHONE_PATTERN.matcher(phone).matches();
    }
    
    // 提取URL
    public static List<String> extractUrls(String text) {
        Pattern urlPattern = Pattern.compile(
            "https?://[\\w\\.-]+(?:\\:[0-9]+)?(?:/[\\w\\._~:/?#\\[\\]@!$&'()*+,;=-]*)?",
            Pattern.CASE_INSENSITIVE
        );
        
        Matcher matcher = urlPattern.matcher(text);
        List<String> urls = new ArrayList<>();
        
        while (matcher.find()) {
            urls.add(matcher.group());
        }
        
        return urls;
    }
}

9.4 PHP中的正则表达式

PHP提供PCRE（Perl Compatible Regular Expressions）函数。

基础函数

<?php
// preg_match - 执行匹配
$text = "Hello World 123";
if (preg_match('/\d+/', $text, $matches)) {
    echo "匹配: " . $matches[0] . "\n";  // "123"
}

// preg_match_all - 执行全局匹配
$text = "12 cats, 34 dogs, 56 birds";
preg_match_all('/\d+/', $text, $matches);
print_r($matches[0]);  // Array([0] => 12 [1] => 34 [2] => 56)

// preg_replace - 替换匹配
$result = preg_replace('/\d+/', '[数字]', $text);
echo $result . "\n";  // "[数字] cats, [数字] dogs, [数字] birds"

// preg_split - 分割字符串
$parts = preg_split('/[,;:]/', 'a,b;c:d');
print_r($parts);  // Array([0] => a [1] => b [2] => c [3] => d)

// preg_grep - 过滤数组
$items = array('apple', 'banana', 'cherry', 'date');
$filtered = preg_grep('/a/', $items);
print_r($filtered);  // 包含字母 'a' 的项
?>

分组和捕获

<?php
$text = "John Doe, 25 years old";
$pattern = '/(\w+) (\w+), (\d+) years old/';

if (preg_match($pattern, $text, $matches)) {
    echo "完整匹配: " . $matches[0] . "\n";
    echo "名字: " . $matches[1] . "\n";
    echo "姓氏: " . $matches[2] . "\n";
    echo "年龄: " . $matches[3] . "\n";
}

// 命名捕获组（PHP 7.2+）
$pattern = '/(?<first>\w+) (?<last>\w+), (?<age>\d+) years old/';
if (preg_match($pattern, $text, $matches)) {
    echo "名字: " . $matches['first'] . "\n";
    echo "姓氏: " . $matches['last'] . "\n";
    echo "年龄: " . $matches['age'] . "\n";
}
?>

实用工具类

<?php
class RegexValidator {
    // 邮箱验证
    public static function validateEmail($email) {
        $pattern = '/^[a-zA-Z0-9.!#$%&\'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/';
        return preg_match($pattern, $email) === 1;
    }
    
    // 手机号验证（中国）
    public static function validatePhone($phone) {
        $pattern = '/^1[3-9]\d{9}$/';
        return preg_match($pattern, $phone) === 1;
    }
    
    // 提取所有链接
    public static function extractLinks($text) {
        $pattern = '/https?:\/\/[\w\.-]+(?:\:[0-9]+)?(?:\/[\w\._~:\/?#\[\]@!$&\'()*+,;=-]*)?/i';
        preg_match_all($pattern, $text, $matches);
        return $matches[0];
    }
    
    // 格式化电话号码
    public static function formatPhone($phone) {
        $clean = preg_replace('/\D/', '', $phone);
        if (preg_match('/^(\d{3})(\d{4})(\d{4})$/', $clean, $matches)) {
            return $matches[1] . '-' . $matches[2] . '-' . $matches[3];
        }
        return $phone;
    }
}

// 使用示例
echo RegexValidator::validateEmail('test@example.com') ? '有效' : '无效';
echo "\n";

$links = RegexValidator::extractLinks('访问 https://example.com 或 http://test.org');
print_r($links);

echo RegexValidator::formatPhone('13812345678') . "\n";  // 138-1234-5678
?>

9.5 语言间的差异对比

语法差异

// JavaScript - 转义较少
const jsRegex = /\d+/g;

// Java - 需要双重转义
Pattern javaPattern = Pattern.compile("\\d+");

// Python - 使用原始字符串
pattern = re.compile(r'\d+')

// PHP - 需要定界符
$phpPattern = '/\d+/';

修饰符差异

// JavaScript
const flags = 'gimsuvy';

// Python  
flags = re.I | re.M | re.S

// Java
Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE)

// PHP
'/pattern/ims'

功能特性对比

| 特性 | JavaScript | Python | Java | PHP |
|------|------------|--------|------|-----|
| 命名分组 | ✓ (ES2018) | ✓ | ✓ | ✓ (7.2+) |
| 后行断言 | ✓ (ES2018) | ✓ | ✓ | ✓ |
| 递归模式 | ✗ | ✗ | ✗ | ✓ |
| 条件模式 | ✗ | ✗ | ✗ | ✓ |
| 原子分组 | ✗ | ✗ | ✓ | ✓ |

9.6 跨语言正则表达式最佳实践

通用模式库

// 创建跨语言兼容的模式
const patterns = {
    email: {
        js: /^[^\s@]+@[^\s@]+\.[^\s@]+$/,
        python: r'^[^\s@]+@[^\s@]+\.[^\s@]+$',
        java: "^[^\\s@]+@[^\\s@]+\\.[^\\s@]+$",
        php: '/^[^\s@]+@[^\s@]+\.[^\s@]+$/'
    },
    
    phone: {
        js: /^\d{3}-?\d{3}-?\d{4}$/,
        python: r'^\d{3}-?\d{3}-?\d{4}$',
        java: "^\\d{3}-?\\d{3}-?\\d{4}$",
        php: '/^\d{3}-?\d{3}-?\d{4}$/'
    }
};

统一验证函数

// JavaScript版本
function createValidator(patterns) {
    return {
        email: (value) => patterns.email.js.test(value),
        phone: (value) => patterns.phone.js.test(value)
    };
}

# Python版本
import re

def create_validator(patterns):
    compiled = {
        key: re.compile(patterns[key]['python']) 
        for key in patterns
    }
    
    return {
        'email': lambda value: compiled['email'].match(value) is not None,
        'phone': lambda value: compiled['phone'].match(value) is not None
    }

小结

本章详细介绍了主流编程语言中的正则表达式使用：

JavaScript：RegExp对象，字符串方法，ES2018新特性
Python：re模块，编译选项，命名分组
Java：Pattern/Matcher类，静态方法，性能优化
PHP：PCRE函数，定界符，丰富的高级特性
语言差异：语法、转义、修饰符、特性支持的对比
最佳实践：跨语言兼容性，通用模式设计

理解这些差异有助于在不同语言环境中正确使用正则表达式，并编写可维护的代码。