第 5 章：分组和捕获

Haiyue2025/9/1大约 9 分钟

第 5 章：分组和捕获

学习目标

理解分组的概念和语法()
掌握捕获组的使用和引用
学会使用非捕获组(?😃
掌握命名捕获组的语法和应用
理解反向引用的概念和使用

5.1 分组概述

分组是正则表达式中的重要概念，它允许我们：

将多个字符作为一个整体处理
应用量词到一组字符
捕获匹配的内容供后续使用
创建复杂的匹配模式

分组的基本语法

(pattern)   # 捕获组
(?:pattern) # 非捕获组
(?<name>pattern) # 命名捕获组（部分引擎支持）

5.2 基础分组 ()

最基本的分组使用圆括号将多个字符组合成一个单元。

基础用法

// 将 "abc" 作为一个整体
const pattern1 = /(abc)/;
console.log(pattern1.test("abcdef")); // true

// 对组应用量词
const pattern2 = /(abc)+/;
console.log("abcabcabc".match(pattern2)[0]); // "abcabcabc"

// 可选的组
const pattern3 = /(www\.)?example\.com/;
console.log(pattern3.test("example.com"));     // true
console.log(pattern3.test("www.example.com")); // true

嵌套分组

// 嵌套分组示例
const datePattern = /(\d{4})-(\d{2})-(\d{2})/;
const timePattern = /(\d{2}):(\d{2}):(\d{2})/;

// 组合日期和时间
const datetimePattern = /((\d{4})-(\d{2})-(\d{2})) ((\d{2}):(\d{2}):(\d{2}))/;

const datetime = "2023-12-01 14:30:45";
const match = datetime.match(datetimePattern);

console.log(match[0]); // "2023-12-01 14:30:45" (完整匹配)
console.log(match[1]); // "2023-12-01" (日期组)
console.log(match[5]); // "14:30:45" (时间组)

5.3 捕获组的使用

捕获组不仅用于分组，还会保存匹配的内容。

访问捕获组

const text = "John Doe, 25 years old";
const pattern = /(\w+) (\w+), (\d+) years old/;
const match = text.match(pattern);

console.log(match[0]); // "John Doe, 25 years old" (完整匹配)
console.log(match[1]); // "John" (第一个捕获组)
console.log(match[2]); // "Doe" (第二个捕获组)  
console.log(match[3]); // "25" (第三个捕获组)

使用解构赋值

const emailPattern = /^([^\s@]+)@([^\s@]+)\.([^\s@]+)$/;
const email = "user@example.com";
const match = email.match(emailPattern);

if (match) {
    const [fullMatch, username, domain, extension] = match;
    console.log({
        fullMatch,  // "user@example.com"
        username,   // "user"
        domain,     // "example"
        extension   // "com"
    });
}

在替换中使用捕获组

// 交换姓名格式
const names = "John Doe, Jane Smith, Bob Johnson";
const swapped = names.replace(/(\w+) (\w+)/g, "$2, $1");
console.log(swapped); // "Doe, John, Smith, Jane, Johnson, Bob"

// 格式化电话号码
const phone = "1234567890";
const formatted = phone.replace(/(\d{3})(\d{3})(\d{4})/, "($1) $2-$3");
console.log(formatted); // "(123) 456-7890"

// 使用回调函数进行复杂替换
const html = "<p>Hello</p><div>World</div>";
const result = html.replace(/<(\w+)>(.*?)<\/\1>/g, (match, tag, content) => {
    return `[${tag.toUpperCase()}]${content.toUpperCase()}[/${tag.toUpperCase()}]`;
});
console.log(result); // "[P]HELLO[/P][DIV]WORLD[/DIV]"

5.4 非捕获组 (?😃

非捕获组提供分组功能但不保存匹配内容，可以提高性能。

基础用法

// 捕获组版本
const withCapture = /(https?)://([\w.-]+)/;
const url1 = "https://example.com";
const match1 = url1.match(withCapture);
console.log(match1.length); // 3 (完整匹配 + 2个捕获组)

// 非捕获组版本
const withoutCapture = /(?:https?)://([\w.-]+)/;
const match2 = url1.match(withoutCapture);
console.log(match2.length); // 2 (完整匹配 + 1个捕获组)

实际应用

// 匹配不同的文件扩展名，但只捕获文件名
const filePattern = /([\w-]+)\.(?:jpg|png|gif|bmp)/i;
const filename = "photo.jpg";
const match = filename.match(filePattern);

console.log(match[1]); // "photo" (只有文件名被捕获)

// 匹配URL协议，但不捕获协议部分
const urlPattern = /(?:https?|ftp):\/\/([\w.-]+)/;
const urls = ["http://example.com", "https://test.org", "ftp://files.com"];

urls.forEach(url => {
    const match = url.match(urlPattern);
    if (match) {
        console.log(`域名: ${match[1]}`);
    }
});

性能考虑

// 如果不需要捕获，使用非捕获组可以提高性能
const inefficient = /(red|green|blue)/g;
const efficient = /(?:red|green|blue)/g;

const text = "red car, blue sky, green grass";
console.log(text.match(efficient)); // ["red", "blue", "green"]

5.5 命名捕获组

ES2018 引入了命名捕获组，使代码更具可读性。

基础语法

// 命名捕获组语法
const pattern = /(?<name>pattern)/;

// 实际例子
const datePattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const date = "2023-12-01";
const match = date.match(datePattern);

console.log(match.groups.year);  // "2023"
console.log(match.groups.month); // "12"
console.log(match.groups.day);   // "01"

// 仍然可以通过索引访问
console.log(match[1]); // "2023"
console.log(match[2]); // "12"
console.log(match[3]); // "01"

实际应用

// 解析邮箱地址
const emailPattern = /^(?<username>[^\s@]+)@(?<domain>[^\s@]+)\.(?<extension>[^\s@]+)$/;
const email = "user@example.com";
const match = email.match(emailPattern);

if (match) {
    const { username, domain, extension } = match.groups;
    console.log({ username, domain, extension });
    // { username: "user", domain: "example", extension: "com" }
}

// 解析URL
const urlPattern = /^(?<protocol>https?):\/\/(?<host>[\w.-]+)(?<port>:\d+)?(?<path>\/.*)?$/;
const url = "https://example.com:8080/api/users";
const urlMatch = url.match(urlPattern);

if (urlMatch) {
    console.log(urlMatch.groups);
    // { protocol: "https", host: "example.com", port: ":8080", path: "/api/users" }
}

在替换中使用命名捕获组

// 使用 $<name> 语法
const text = "John Doe";
const namePattern = /(?<first>\w+) (?<last>\w+)/;
const reversed = text.replace(namePattern, "$<last>, $<first>");
console.log(reversed); // "Doe, John"

// 在回调函数中使用
const formatted = text.replace(namePattern, (match, p1, p2, offset, string, groups) => {
    return `${groups.last}, ${groups.first}`;
});
console.log(formatted); // "Doe, John"

5.6 反向引用

反向引用允许引用之前捕获组的内容。

基础语法

// \1 引用第一个捕获组，\2 引用第二个捕获组，以此类推
const repeatedPattern = /(\w+)\s+\1/; // 匹配重复的单词
console.log(repeatedPattern.test("hello hello")); // true
console.log(repeatedPattern.test("hello world")); // false

// 匹配HTML标签对
const htmlTagPattern = /<(\w+)>(.*?)<\/\1>/;
console.log(htmlTagPattern.test("<p>content</p>"));   // true
console.log(htmlTagPattern.test("<p>content</div>")); // false

实际应用

// 查找重复的单词
const text = "This is is a test test sentence";
const duplicateWords = /\b(\w+)\s+\1\b/g;
const duplicates = [];
let match;

while ((match = duplicateWords.exec(text)) !== null) {
    duplicates.push(match[1]);
}
console.log(duplicates); // ["is", "test"]

// 匹配引号内容（双引号或单引号）
const quotePattern = /(['"])(.*?)\1/g;
const textWithQuotes = 'He said "Hello" and she replied \'Hi there\'';
const quotes = [...textWithQuotes.matchAll(quotePattern)];

quotes.forEach(match => {
    console.log(`引号类型: ${match[1]}, 内容: ${match[2]}`);
});
// 引号类型: ", 内容: Hello
// 引号类型: ', 内容: Hi there

命名捕获组的反向引用

// 使用 \k<name> 语法（部分引擎支持）
// JavaScript 中可以在替换时使用 $<name>

const htmlPattern = /(?<tag>\w+)>(?<content>.*?)<\/\k<tag>/;
// JavaScript 中的等效写法
const htmlPatternJS = /<(?<tag>\w+)>(?<content>.*?)<\/(?<tag2>\w+)>/;

// 更实用的方法是在替换中验证
function validateHtmlTags(html) {
    return html.replace(/<(\w+)>(.*?)<\/(\w+)>/g, (match, openTag, content, closeTag) => {
        if (openTag !== closeTag) {
            throw new Error(`标签不匹配: <${openTag}> 和 </${closeTag}>`);
        }
        return match;
    });
}

5.7 条件分组（部分引擎支持）

条件分组允许根据条件选择不同的匹配模式。

语法

// (?(condition)yes|no) - 如果condition匹配，则使用yes，否则使用no
// (?(condition)yes) - 如果condition匹配，则使用yes，否则匹配空

// 注意：JavaScript 原生不支持条件分组，这里展示概念
// 可以用其他方式实现类似效果

JavaScript 中的替代方案

// 使用多个模式和逻辑或
const patterns = [
    /pattern1/,
    /pattern2/,
    /pattern3/
];

function testMultiplePatterns(text) {
    return patterns.some(pattern => pattern.test(text));
}

// 使用函数实现条件逻辑
function conditionalMatch(text, condition) {
    if (condition) {
        return /pattern1/.exec(text);
    } else {
        return /pattern2/.exec(text);
    }
}

5.8 实际应用案例

解析日志文件

const logPattern = /^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?<level>\w+)\] (?<message>.*)$/;

const logLines = [
    "2023-12-01 10:30:45 [INFO] User logged in",
    "2023-12-01 10:31:00 [ERROR] Database connection failed",
    "2023-12-01 10:31:15 [WARN] Low memory warning"
];

const parsedLogs = logLines.map(line => {
    const match = line.match(logPattern);
    return match ? match.groups : null;
}).filter(Boolean);

console.log(parsedLogs);
// [
//   { timestamp: "2023-12-01 10:30:45", level: "INFO", message: "User logged in" },
//   { timestamp: "2023-12-01 10:31:00", level: "ERROR", message: "Database connection failed" },
//   { timestamp: "2023-12-01 10:31:15", level: "WARN", message: "Low memory warning" }
// ]

格式化数据

// 格式化信用卡号
function formatCreditCard(cardNumber) {
    const pattern = /(\d{4})(\d{4})(\d{4})(\d{4})/;
    return cardNumber.replace(pattern, "$1-$2-$3-$4");
}

console.log(formatCreditCard("1234567890123456")); // "1234-5678-9012-3456"

// 格式化电话号码
function formatPhoneNumber(phone) {
    const patterns = [
        { regex: /^(\d{3})(\d{3})(\d{4})$/, format: "($1) $2-$3" },           // 美国格式
        { regex: /^(\d{3})(\d{4})(\d{4})$/, format: "$1-$2-$3" },             // 中国手机
        { regex: /^(\d{4})(\d{3})(\d{3})$/, format: "$1-$2-$3" }              // 其他格式
    ];

    for (const { regex, format } of patterns) {
        if (regex.test(phone)) {
            return phone.replace(regex, format);
        }
    }
    return phone; // 如果不匹配任何模式，返回原值
}

提取和验证

// 提取文本中的所有链接
function extractLinks(text) {
    const linkPattern = /\[(?<text>[^\]]+)\]\((?<url>https?:\/\/[^\)]+)\)/g;
    const links = [];
    let match;

    while ((match = linkPattern.exec(text)) !== null) {
        links.push({
            text: match.groups.text,
            url: match.groups.url
        });
    }

    return links;
}

const markdown = "查看 [Google](https://google.com) 和 [GitHub](https://github.com)";
console.log(extractLinks(markdown));
// [
//   { text: "Google", url: "https://google.com" },
//   { text: "GitHub", url: "https://github.com" }
// ]

5.9 常见错误和最佳实践

错误1：过多的捕获组

// 不好：创建了不必要的捕获组
const inefficient = /(red|green|blue) (car|bike|plane)/;

// 更好：只捕获需要的部分
const efficient = /(red|green|blue) (?:car|bike|plane)/;

// 或者使用命名捕获组提高可读性
const readable = /(?<color>red|green|blue) (?<vehicle>car|bike|plane)/;

错误2：反向引用的误用

// 错误：试图引用不存在的捕获组
const wrong = /(\w+) \2/; // \2 引用不存在的第二个捕获组

// 正确：确保引用的捕获组存在
const correct = /(\w+) (\w+) \1 \2/; // 引用存在的捕获组

最佳实践

// 1. 使用命名捕获组提高可读性
const readable = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;

// 2. 当不需要捕获时使用非捕获组
const efficient = /(?:Mr|Mrs|Ms)\.? (\w+)/;

// 3. 适当使用反向引用
const htmlTags = /<(\w+)>(.*?)<\/\1>/;

// 4. 组合使用提高灵活性
const flexible = /(?<protocol>https?)://(?<domain>[\w.-]+)(?<port>:\d+)?/;

5.10 练习题

练习1：基础分组

编写正则表达式：

匹配重复的单词对（如 "the the"）
匹配HTML标签及其内容
匹配可选的协议部分的URL

// 答案
const duplicateWords = /\b(\w+)\s+\1\b/g;
const htmlTags = /<(\w+)>(.*?)<\/\1>/g;
const urlWithOptionalProtocol = /(https?:\/\/)?[\w.-]+/;

练习2：命名捕获组

解析以下格式的时间戳："2023-12-01T14:30:45Z"

// 答案
const timestampPattern = /^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})T(?<hour>\d{2}):(?<minute>\d{2}):(?<second>\d{2})Z$/;

function parseTimestamp(timestamp) {
    const match = timestamp.match(timestampPattern);
    if (match) {
        return match.groups;
    }
    return null;
}

练习3：实际应用

编写一个函数，从文本中提取所有邮箱地址，并分别获取用户名和域名部分。

// 答案
function extractEmails(text) {
    const emailPattern = /(?<username>[^\s@]+)@(?<domain>[^\s@]+\.[^\s@]+)/g;
    const emails = [];
    let match;

    while ((match = emailPattern.exec(text)) !== null) {
        emails.push({
            full: match[0],
            username: match.groups.username,
            domain: match.groups.domain
        });
    }

    return emails;
}

const text = "联系我们: admin@company.com 或 support@help.org";
console.log(extractEmails(text));

小结

分组和捕获是正则表达式的核心功能：

基础分组 ()：将多个字符作为整体，可以应用量词
捕获组：保存匹配内容，可在后续引用
非捕获组 (?😃：提供分组功能但不保存内容，提高性能
命名捕获组 (?<name>)：提高代码可读性和维护性
反向引用：引用之前捕获组的内容，实现复杂匹配
实际应用：数据解析、格式化、验证等

掌握分组和捕获技术可以让我们编写出更强大和灵活的正则表达式。