Chapter 5: Grouping and Capturing

Haiyue
16min

Chapter 5: Grouping and Capturing

Learning Objectives

  1. Understand the concept and syntax of grouping ()
  2. Master the use and reference of capturing groups
  3. Learn to use non-capturing groups (?:)
  4. Master the syntax and application of named capturing groups
  5. Understand the concept and use of backreferences

5.1 Overview of Grouping

Grouping is an important concept in regular expressions that allows us to:

  • Treat multiple characters as a whole
  • Apply quantifiers to a group of characters
  • Capture matched content for later use
  • Create complex matching patterns

Basic Grouping Syntax

(pattern)   # Capturing group
(?:pattern) # Non-capturing group
(?<name>pattern) # Named capturing group (supported by some engines)

5.2 Basic Grouping ()

The most basic grouping uses parentheses to combine multiple characters into a single unit.

Basic Usage

// Treat "abc" as a whole
const pattern1 = /(abc)/;
console.log(pattern1.test("abcdef")); // true

// Apply quantifier to a group
const pattern2 = /(abc)+/;
console.log("abcabcabc".match(pattern2)[0]); // "abcabcabc"

// Optional group
const pattern3 = /(www\.)?example\.com/;
console.log(pattern3.test("example.com"));     // true
console.log(pattern3.test("www.example.com")); // true

Nested Grouping

// Nested grouping example
const datePattern = /(\d{4})-(\d{2})-(\d{2})/;
const timePattern = /(\d{2}):(\d{2}):(\d{2})/;

// Combined date and time
const datetimePattern = /((\d{4})-(\d{2})-(\d{2})) ((\d{2}):(\d{2}):(\d{2}))/;

const datetime = "2023-12-01 14:30:45";
const match = datetime.match(datetimePattern);

console.log(match[0]); // "2023-12-01 14:30:45" (full match)
console.log(match[1]); // "2023-12-01" (date group)
console.log(match[5]); // "14:30:45" (time group)

5.3 Using Capturing Groups

Capturing groups not only group but also save matched content.

Accessing Capturing Groups

const text = "John Doe, 25 years old";
const pattern = /(\w+) (\w+), (\d+) years old/;
const match = text.match(pattern);

console.log(match[0]); // "John Doe, 25 years old" (full match)
console.log(match[1]); // "John" (first capturing group)
console.log(match[2]); // "Doe" (second capturing group)
console.log(match[3]); // "25" (third capturing group)

Using Destructuring Assignment

const emailPattern = /^([^\s@]+)@([^\s@]+)\.([^\s@]+)$/;
const email = "user@example.com";
const match = email.match(emailPattern);

if (match) {
    const [fullMatch, username, domain, extension] = match;
    console.log({
        fullMatch,  // "user@example.com"
        username,   // "user"
        domain,     // "example"
        extension   // "com"
    });
}

Using Capturing Groups in Replacements

// Swap name format
const names = "John Doe, Jane Smith, Bob Johnson";
const swapped = names.replace(/(\w+) (\w+)/g, "$2, $1");
console.log(swapped); // "Doe, John, Smith, Jane, Johnson, Bob"

// Format phone number
const phone = "1234567890";
const formatted = phone.replace(/(\d{3})(\d{3})(\d{4})/, "($1) $2-$3");
console.log(formatted); // "(123) 456-7890"

// Use callback function for complex replacements
const html = "<p>Hello</p><div>World</div>";
const result = html.replace(/<(\w+)>(.*?)<\/\1>/g, (match, tag, content) => {
    return `[${tag.toUpperCase()}]${content.toUpperCase()}[/${tag.toUpperCase()}]`;
});
console.log(result); // "[P]HELLO[/P][DIV]WORLD[/DIV]"

5.4 Non-Capturing Groups (?:)

Non-capturing groups provide grouping functionality without saving matched content, which can improve performance.

Basic Usage

// Version with capturing groups
const withCapture = /(https?):\/\/([\w.-]+)/;
const url1 = "https://example.com";
const match1 = url1.match(withCapture);
console.log(match1.length); // 3 (full match + 2 capturing groups)

// Version with non-capturing groups
const withoutCapture = /(?:https?):\/\/([\w.-]+)/;
const match2 = url1.match(withoutCapture);
console.log(match2.length); // 2 (full match + 1 capturing group)

Practical Applications

// Match different file extensions, but only capture filename
const filePattern = /([\w-]+)\.(?:jpg|png|gif|bmp)/i;
const filename = "photo.jpg";
const match = filename.match(filePattern);

console.log(match[1]); // "photo" (only filename is captured)

// Match URL protocol, but don't capture protocol part
const urlPattern = /(?:https?|ftp):\/\/([\w.-]+)/;
const urls = ["http://example.com", "https://test.org", "ftp://files.com"];

urls.forEach(url => {
    const match = url.match(urlPattern);
    if (match) {
        console.log(`Domain: ${match[1]}`);
    }
});

Performance Considerations

// If you don't need to capture, use non-capturing groups for better performance
const inefficient = /(red|green|blue)/g;
const efficient = /(?:red|green|blue)/g;

const text = "red car, blue sky, green grass";
console.log(text.match(efficient)); // ["red", "blue", "green"]

5.5 Named Capturing Groups

ES2018 introduced named capturing groups, making code more readable.

Basic Syntax

// Named capturing group syntax
const pattern = /(?<name>pattern)/;

// Practical example
const datePattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const date = "2023-12-01";
const match = date.match(datePattern);

console.log(match.groups.year);  // "2023"
console.log(match.groups.month); // "12"
console.log(match.groups.day);   // "01"

// Still accessible via index
console.log(match[1]); // "2023"
console.log(match[2]); // "12"
console.log(match[3]); // "01"

Practical Applications

// Parse email address
const emailPattern = /^(?<username>[^\s@]+)@(?<domain>[^\s@]+)\.(?<extension>[^\s@]+)$/;
const email = "user@example.com";
const match = email.match(emailPattern);

if (match) {
    const { username, domain, extension } = match.groups;
    console.log({ username, domain, extension });
    // { username: "user", domain: "example", extension: "com" }
}

// Parse URL
const urlPattern = /^(?<protocol>https?):\/\/(?<host>[\w.-]+)(?<port>:\d+)?(?<path>\/.*)?$/;
const url = "https://example.com:8080/api/users";
const urlMatch = url.match(urlPattern);

if (urlMatch) {
    console.log(urlMatch.groups);
    // { protocol: "https", host: "example.com", port: ":8080", path: "/api/users" }
}

Using Named Capturing Groups in Replacements

// Using $<name> syntax
const text = "John Doe";
const namePattern = /(?<first>\w+) (?<last>\w+)/;
const reversed = text.replace(namePattern, "$<last>, $<first>");
console.log(reversed); // "Doe, John"

// Using in callback function
const formatted = text.replace(namePattern, (match, p1, p2, offset, string, groups) => {
    return `${groups.last}, ${groups.first}`;
});
console.log(formatted); // "Doe, John"

5.6 Backreferences

Backreferences allow you to reference the content of previously captured groups.

Basic Syntax

// \1 references the first capturing group, \2 the second, etc.
const repeatedPattern = /(\w+)\s+\1/; // Match repeated words
console.log(repeatedPattern.test("hello hello")); // true
console.log(repeatedPattern.test("hello world")); // false

// Match HTML tag pairs
const htmlTagPattern = /<(\w+)>(.*?)<\/\1>/;
console.log(htmlTagPattern.test("<p>content</p>"));   // true
console.log(htmlTagPattern.test("<p>content</div>")); // false

Practical Applications

// Find duplicate words
const text = "This is is a test test sentence";
const duplicateWords = /\b(\w+)\s+\1\b/g;
const duplicates = [];
let match;

while ((match = duplicateWords.exec(text)) !== null) {
    duplicates.push(match[1]);
}
console.log(duplicates); // ["is", "test"]

// Match quoted content (double or single quotes)
const quotePattern = /(['"])(.*?)\1/g;
const textWithQuotes = 'He said "Hello" and she replied \'Hi there\'';
const quotes = [...textWithQuotes.matchAll(quotePattern)];

quotes.forEach(match => {
    console.log(`Quote type: ${match[1]}, Content: ${match[2]}`);
});
// Quote type: ", Content: Hello
// Quote type: ', Content: Hi there

Backreferences with Named Capturing Groups

// Using \k<name> syntax (supported by some engines)
// In JavaScript, use $<name> during replacement

const htmlPattern = /(?<tag>\w+)>(?<content>.*?)<\/\k<tag>/;
// JavaScript equivalent
const htmlPatternJS = /<(?<tag>\w+)>(?<content>.*?)<\/(?<tag2>\w+)>/;

// More practical approach: validate during replacement
function validateHtmlTags(html) {
    return html.replace(/<(\w+)>(.*?)<\/(\w+)>/g, (match, openTag, content, closeTag) => {
        if (openTag !== closeTag) {
            throw new Error(`Tag mismatch: <${openTag}> and </${closeTag}>`);
        }
        return match;
    });
}

5.7 Conditional Grouping (Supported by Some Engines)

Conditional grouping allows selecting different matching patterns based on conditions.

Syntax

// (?(condition)yes|no) - if condition matches, use yes, otherwise use no
// (?(condition)yes) - if condition matches, use yes, otherwise match empty

// Note: JavaScript doesn't natively support conditional grouping, showing concept here
// Can implement similar effects using other approaches

Alternative Solutions in JavaScript

// Using multiple patterns and logical OR
const patterns = [
    /pattern1/,
    /pattern2/,
    /pattern3/
];

function testMultiplePatterns(text) {
    return patterns.some(pattern => pattern.test(text));
}

// Using functions to implement conditional logic
function conditionalMatch(text, condition) {
    if (condition) {
        return /pattern1/.exec(text);
    } else {
        return /pattern2/.exec(text);
    }
}

5.8 Practical Use Cases

Parsing Log Files

const logPattern = /^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?<level>\w+)\] (?<message>.*)$/;

const logLines = [
    "2023-12-01 10:30:45 [INFO] User logged in",
    "2023-12-01 10:31:00 [ERROR] Database connection failed",
    "2023-12-01 10:31:15 [WARN] Low memory warning"
];

const parsedLogs = logLines.map(line => {
    const match = line.match(logPattern);
    return match ? match.groups : null;
}).filter(Boolean);

console.log(parsedLogs);
// [
//   { timestamp: "2023-12-01 10:30:45", level: "INFO", message: "User logged in" },
//   { timestamp: "2023-12-01 10:31:00", level: "ERROR", message: "Database connection failed" },
//   { timestamp: "2023-12-01 10:31:15", level: "WARN", message: "Low memory warning" }
// ]

Formatting Data

// Format credit card number
function formatCreditCard(cardNumber) {
    const pattern = /(\d{4})(\d{4})(\d{4})(\d{4})/;
    return cardNumber.replace(pattern, "$1-$2-$3-$4");
}

console.log(formatCreditCard("1234567890123456")); // "1234-5678-9012-3456"

// Format phone number
function formatPhoneNumber(phone) {
    const patterns = [
        { regex: /^(\d{3})(\d{3})(\d{4})$/, format: "($1) $2-$3" },           // US format
        { regex: /^(\d{3})(\d{4})(\d{4})$/, format: "$1-$2-$3" },             // China mobile
        { regex: /^(\d{4})(\d{3})(\d{3})$/, format: "$1-$2-$3" }              // Other format
    ];

    for (const { regex, format } of patterns) {
        if (regex.test(phone)) {
            return phone.replace(regex, format);
        }
    }
    return phone; // Return original if no match
}

Extraction and Validation

// Extract all links from text
function extractLinks(text) {
    const linkPattern = /\[(?<text>[^\]]+)\]\((?<url>https?:\/\/[^\)]+)\)/g;
    const links = [];
    let match;

    while ((match = linkPattern.exec(text)) !== null) {
        links.push({
            text: match.groups.text,
            url: match.groups.url
        });
    }

    return links;
}

const markdown = "Check out [Google](https://google.com) and [GitHub](https://github.com)";
console.log(extractLinks(markdown));
// [
//   { text: "Google", url: "https://google.com" },
//   { text: "GitHub", url: "https://github.com" }
// ]

5.9 Common Mistakes and Best Practices

Mistake 1: Too Many Capturing Groups

// Bad: Creates unnecessary capturing groups
const inefficient = /(red|green|blue) (car|bike|plane)/;

// Better: Only capture what's needed
const efficient = /(red|green|blue) (?:car|bike|plane)/;

// Or use named capturing groups for better readability
const readable = /(?<color>red|green|blue) (?<vehicle>car|bike|plane)/;

Mistake 2: Misusing Backreferences

// Wrong: Trying to reference non-existent capturing group
const wrong = /(\w+) \2/; // \2 references non-existent second capturing group

// Correct: Ensure referenced capturing group exists
const correct = /(\w+) (\w+) \1 \2/; // References existing capturing groups

Best Practices

// 1. Use named capturing groups for better readability
const readable = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;

// 2. Use non-capturing groups when you don't need to capture
const efficient = /(?:Mr|Mrs|Ms)\.? (\w+)/;

// 3. Use backreferences appropriately
const htmlTags = /<(\w+)>(.*?)<\/\1>/;

// 4. Combine for flexibility
const flexible = /(?<protocol>https?):\/\/(?<domain>[\w.-]+)(?<port>:\d+)?/;

5.10 Practice Exercises

Exercise 1: Basic Grouping

Write regular expressions:

  1. Match repeated word pairs (like “the the”)
  2. Match HTML tags and their content
  3. Match URLs with optional protocol part
// Answers
const duplicateWords = /\b(\w+)\s+\1\b/g;
const htmlTags = /<(\w+)>(.*?)<\/\1>/g;
const urlWithOptionalProtocol = /(https?:\/\/)?[\w.-]+/;

Exercise 2: Named Capturing Groups

Parse timestamp in the format: “2023-12-01T14:30

// Answer
const timestampPattern = /^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})T(?<hour>\d{2}):(?<minute>\d{2}):(?<second>\d{2})Z$/;

function parseTimestamp(timestamp) {
    const match = timestamp.match(timestampPattern);
    if (match) {
        return match.groups;
    }
    return null;
}

Exercise 3: Practical Application

Write a function to extract all email addresses from text and get username and domain separately.

// Answer
function extractEmails(text) {
    const emailPattern = /(?<username>[^\s@]+)@(?<domain>[^\s@]+\.[^\s@]+)/g;
    const emails = [];
    let match;

    while ((match = emailPattern.exec(text)) !== null) {
        emails.push({
            full: match[0],
            username: match.groups.username,
            domain: match.groups.domain
        });
    }

    return emails;
}

const text = "Contact us: admin@company.com or support@help.org";
console.log(extractEmails(text));

Summary

Grouping and capturing are core features of regular expressions:

  1. Basic Grouping (): Treats multiple characters as a whole, can apply quantifiers
  2. Capturing Groups: Save matched content, can be referenced later
  3. Non-Capturing Groups (?:): Provide grouping without saving content, improves performance
  4. Named Capturing Groups (?): Improve code readability and maintainability
  5. Backreferences: Reference previously captured group content for complex matching
  6. Practical Applications: Data parsing, formatting, validation, etc.

Mastering grouping and capturing techniques allows us to write more powerful and flexible regular expressions.