Chapter 5: Grouping and Capturing
Chapter 5: Grouping and Capturing
Learning Objectives
- Understand the concept and syntax of grouping ()
- Master the use and reference of capturing groups
- Learn to use non-capturing groups (?:)
- Master the syntax and application of named capturing groups
- Understand the concept and use of backreferences
5.1 Overview of Grouping
Grouping is an important concept in regular expressions that allows us to:
- Treat multiple characters as a whole
- Apply quantifiers to a group of characters
- Capture matched content for later use
- Create complex matching patterns
Basic Grouping Syntax
(pattern) # Capturing group
(?:pattern) # Non-capturing group
(?<name>pattern) # Named capturing group (supported by some engines)
5.2 Basic Grouping ()
The most basic grouping uses parentheses to combine multiple characters into a single unit.
Basic Usage
// Treat "abc" as a whole
const pattern1 = /(abc)/;
console.log(pattern1.test("abcdef")); // true
// Apply quantifier to a group
const pattern2 = /(abc)+/;
console.log("abcabcabc".match(pattern2)[0]); // "abcabcabc"
// Optional group
const pattern3 = /(www\.)?example\.com/;
console.log(pattern3.test("example.com")); // true
console.log(pattern3.test("www.example.com")); // true
Nested Grouping
// Nested grouping example
const datePattern = /(\d{4})-(\d{2})-(\d{2})/;
const timePattern = /(\d{2}):(\d{2}):(\d{2})/;
// Combined date and time
const datetimePattern = /((\d{4})-(\d{2})-(\d{2})) ((\d{2}):(\d{2}):(\d{2}))/;
const datetime = "2023-12-01 14:30:45";
const match = datetime.match(datetimePattern);
console.log(match[0]); // "2023-12-01 14:30:45" (full match)
console.log(match[1]); // "2023-12-01" (date group)
console.log(match[5]); // "14:30:45" (time group)
5.3 Using Capturing Groups
Capturing groups not only group but also save matched content.
Accessing Capturing Groups
const text = "John Doe, 25 years old";
const pattern = /(\w+) (\w+), (\d+) years old/;
const match = text.match(pattern);
console.log(match[0]); // "John Doe, 25 years old" (full match)
console.log(match[1]); // "John" (first capturing group)
console.log(match[2]); // "Doe" (second capturing group)
console.log(match[3]); // "25" (third capturing group)
Using Destructuring Assignment
const emailPattern = /^([^\s@]+)@([^\s@]+)\.([^\s@]+)$/;
const email = "user@example.com";
const match = email.match(emailPattern);
if (match) {
const [fullMatch, username, domain, extension] = match;
console.log({
fullMatch, // "user@example.com"
username, // "user"
domain, // "example"
extension // "com"
});
}
Using Capturing Groups in Replacements
// Swap name format
const names = "John Doe, Jane Smith, Bob Johnson";
const swapped = names.replace(/(\w+) (\w+)/g, "$2, $1");
console.log(swapped); // "Doe, John, Smith, Jane, Johnson, Bob"
// Format phone number
const phone = "1234567890";
const formatted = phone.replace(/(\d{3})(\d{3})(\d{4})/, "($1) $2-$3");
console.log(formatted); // "(123) 456-7890"
// Use callback function for complex replacements
const html = "<p>Hello</p><div>World</div>";
const result = html.replace(/<(\w+)>(.*?)<\/\1>/g, (match, tag, content) => {
return `[${tag.toUpperCase()}]${content.toUpperCase()}[/${tag.toUpperCase()}]`;
});
console.log(result); // "[P]HELLO[/P][DIV]WORLD[/DIV]"
5.4 Non-Capturing Groups (?:)
Non-capturing groups provide grouping functionality without saving matched content, which can improve performance.
Basic Usage
// Version with capturing groups
const withCapture = /(https?):\/\/([\w.-]+)/;
const url1 = "https://example.com";
const match1 = url1.match(withCapture);
console.log(match1.length); // 3 (full match + 2 capturing groups)
// Version with non-capturing groups
const withoutCapture = /(?:https?):\/\/([\w.-]+)/;
const match2 = url1.match(withoutCapture);
console.log(match2.length); // 2 (full match + 1 capturing group)
Practical Applications
// Match different file extensions, but only capture filename
const filePattern = /([\w-]+)\.(?:jpg|png|gif|bmp)/i;
const filename = "photo.jpg";
const match = filename.match(filePattern);
console.log(match[1]); // "photo" (only filename is captured)
// Match URL protocol, but don't capture protocol part
const urlPattern = /(?:https?|ftp):\/\/([\w.-]+)/;
const urls = ["http://example.com", "https://test.org", "ftp://files.com"];
urls.forEach(url => {
const match = url.match(urlPattern);
if (match) {
console.log(`Domain: ${match[1]}`);
}
});
Performance Considerations
// If you don't need to capture, use non-capturing groups for better performance
const inefficient = /(red|green|blue)/g;
const efficient = /(?:red|green|blue)/g;
const text = "red car, blue sky, green grass";
console.log(text.match(efficient)); // ["red", "blue", "green"]
5.5 Named Capturing Groups
ES2018 introduced named capturing groups, making code more readable.
Basic Syntax
// Named capturing group syntax
const pattern = /(?<name>pattern)/;
// Practical example
const datePattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const date = "2023-12-01";
const match = date.match(datePattern);
console.log(match.groups.year); // "2023"
console.log(match.groups.month); // "12"
console.log(match.groups.day); // "01"
// Still accessible via index
console.log(match[1]); // "2023"
console.log(match[2]); // "12"
console.log(match[3]); // "01"
Practical Applications
// Parse email address
const emailPattern = /^(?<username>[^\s@]+)@(?<domain>[^\s@]+)\.(?<extension>[^\s@]+)$/;
const email = "user@example.com";
const match = email.match(emailPattern);
if (match) {
const { username, domain, extension } = match.groups;
console.log({ username, domain, extension });
// { username: "user", domain: "example", extension: "com" }
}
// Parse URL
const urlPattern = /^(?<protocol>https?):\/\/(?<host>[\w.-]+)(?<port>:\d+)?(?<path>\/.*)?$/;
const url = "https://example.com:8080/api/users";
const urlMatch = url.match(urlPattern);
if (urlMatch) {
console.log(urlMatch.groups);
// { protocol: "https", host: "example.com", port: ":8080", path: "/api/users" }
}
Using Named Capturing Groups in Replacements
// Using $<name> syntax
const text = "John Doe";
const namePattern = /(?<first>\w+) (?<last>\w+)/;
const reversed = text.replace(namePattern, "$<last>, $<first>");
console.log(reversed); // "Doe, John"
// Using in callback function
const formatted = text.replace(namePattern, (match, p1, p2, offset, string, groups) => {
return `${groups.last}, ${groups.first}`;
});
console.log(formatted); // "Doe, John"
5.6 Backreferences
Backreferences allow you to reference the content of previously captured groups.
Basic Syntax
// \1 references the first capturing group, \2 the second, etc.
const repeatedPattern = /(\w+)\s+\1/; // Match repeated words
console.log(repeatedPattern.test("hello hello")); // true
console.log(repeatedPattern.test("hello world")); // false
// Match HTML tag pairs
const htmlTagPattern = /<(\w+)>(.*?)<\/\1>/;
console.log(htmlTagPattern.test("<p>content</p>")); // true
console.log(htmlTagPattern.test("<p>content</div>")); // false
Practical Applications
// Find duplicate words
const text = "This is is a test test sentence";
const duplicateWords = /\b(\w+)\s+\1\b/g;
const duplicates = [];
let match;
while ((match = duplicateWords.exec(text)) !== null) {
duplicates.push(match[1]);
}
console.log(duplicates); // ["is", "test"]
// Match quoted content (double or single quotes)
const quotePattern = /(['"])(.*?)\1/g;
const textWithQuotes = 'He said "Hello" and she replied \'Hi there\'';
const quotes = [...textWithQuotes.matchAll(quotePattern)];
quotes.forEach(match => {
console.log(`Quote type: ${match[1]}, Content: ${match[2]}`);
});
// Quote type: ", Content: Hello
// Quote type: ', Content: Hi there
Backreferences with Named Capturing Groups
// Using \k<name> syntax (supported by some engines)
// In JavaScript, use $<name> during replacement
const htmlPattern = /(?<tag>\w+)>(?<content>.*?)<\/\k<tag>/;
// JavaScript equivalent
const htmlPatternJS = /<(?<tag>\w+)>(?<content>.*?)<\/(?<tag2>\w+)>/;
// More practical approach: validate during replacement
function validateHtmlTags(html) {
return html.replace(/<(\w+)>(.*?)<\/(\w+)>/g, (match, openTag, content, closeTag) => {
if (openTag !== closeTag) {
throw new Error(`Tag mismatch: <${openTag}> and </${closeTag}>`);
}
return match;
});
}
5.7 Conditional Grouping (Supported by Some Engines)
Conditional grouping allows selecting different matching patterns based on conditions.
Syntax
// (?(condition)yes|no) - if condition matches, use yes, otherwise use no
// (?(condition)yes) - if condition matches, use yes, otherwise match empty
// Note: JavaScript doesn't natively support conditional grouping, showing concept here
// Can implement similar effects using other approaches
Alternative Solutions in JavaScript
// Using multiple patterns and logical OR
const patterns = [
/pattern1/,
/pattern2/,
/pattern3/
];
function testMultiplePatterns(text) {
return patterns.some(pattern => pattern.test(text));
}
// Using functions to implement conditional logic
function conditionalMatch(text, condition) {
if (condition) {
return /pattern1/.exec(text);
} else {
return /pattern2/.exec(text);
}
}
5.8 Practical Use Cases
Parsing Log Files
const logPattern = /^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?<level>\w+)\] (?<message>.*)$/;
const logLines = [
"2023-12-01 10:30:45 [INFO] User logged in",
"2023-12-01 10:31:00 [ERROR] Database connection failed",
"2023-12-01 10:31:15 [WARN] Low memory warning"
];
const parsedLogs = logLines.map(line => {
const match = line.match(logPattern);
return match ? match.groups : null;
}).filter(Boolean);
console.log(parsedLogs);
// [
// { timestamp: "2023-12-01 10:30:45", level: "INFO", message: "User logged in" },
// { timestamp: "2023-12-01 10:31:00", level: "ERROR", message: "Database connection failed" },
// { timestamp: "2023-12-01 10:31:15", level: "WARN", message: "Low memory warning" }
// ]
Formatting Data
// Format credit card number
function formatCreditCard(cardNumber) {
const pattern = /(\d{4})(\d{4})(\d{4})(\d{4})/;
return cardNumber.replace(pattern, "$1-$2-$3-$4");
}
console.log(formatCreditCard("1234567890123456")); // "1234-5678-9012-3456"
// Format phone number
function formatPhoneNumber(phone) {
const patterns = [
{ regex: /^(\d{3})(\d{3})(\d{4})$/, format: "($1) $2-$3" }, // US format
{ regex: /^(\d{3})(\d{4})(\d{4})$/, format: "$1-$2-$3" }, // China mobile
{ regex: /^(\d{4})(\d{3})(\d{3})$/, format: "$1-$2-$3" } // Other format
];
for (const { regex, format } of patterns) {
if (regex.test(phone)) {
return phone.replace(regex, format);
}
}
return phone; // Return original if no match
}
Extraction and Validation
// Extract all links from text
function extractLinks(text) {
const linkPattern = /\[(?<text>[^\]]+)\]\((?<url>https?:\/\/[^\)]+)\)/g;
const links = [];
let match;
while ((match = linkPattern.exec(text)) !== null) {
links.push({
text: match.groups.text,
url: match.groups.url
});
}
return links;
}
const markdown = "Check out [Google](https://google.com) and [GitHub](https://github.com)";
console.log(extractLinks(markdown));
// [
// { text: "Google", url: "https://google.com" },
// { text: "GitHub", url: "https://github.com" }
// ]
5.9 Common Mistakes and Best Practices
Mistake 1: Too Many Capturing Groups
// Bad: Creates unnecessary capturing groups
const inefficient = /(red|green|blue) (car|bike|plane)/;
// Better: Only capture what's needed
const efficient = /(red|green|blue) (?:car|bike|plane)/;
// Or use named capturing groups for better readability
const readable = /(?<color>red|green|blue) (?<vehicle>car|bike|plane)/;
Mistake 2: Misusing Backreferences
// Wrong: Trying to reference non-existent capturing group
const wrong = /(\w+) \2/; // \2 references non-existent second capturing group
// Correct: Ensure referenced capturing group exists
const correct = /(\w+) (\w+) \1 \2/; // References existing capturing groups
Best Practices
// 1. Use named capturing groups for better readability
const readable = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
// 2. Use non-capturing groups when you don't need to capture
const efficient = /(?:Mr|Mrs|Ms)\.? (\w+)/;
// 3. Use backreferences appropriately
const htmlTags = /<(\w+)>(.*?)<\/\1>/;
// 4. Combine for flexibility
const flexible = /(?<protocol>https?):\/\/(?<domain>[\w.-]+)(?<port>:\d+)?/;
5.10 Practice Exercises
Exercise 1: Basic Grouping
Write regular expressions:
- Match repeated word pairs (like “the the”)
- Match HTML tags and their content
- Match URLs with optional protocol part
// Answers
const duplicateWords = /\b(\w+)\s+\1\b/g;
const htmlTags = /<(\w+)>(.*?)<\/\1>/g;
const urlWithOptionalProtocol = /(https?:\/\/)?[\w.-]+/;
Exercise 2: Named Capturing Groups
Parse timestamp in the format: “2023-12-01T14:30
”// Answer
const timestampPattern = /^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})T(?<hour>\d{2}):(?<minute>\d{2}):(?<second>\d{2})Z$/;
function parseTimestamp(timestamp) {
const match = timestamp.match(timestampPattern);
if (match) {
return match.groups;
}
return null;
}
Exercise 3: Practical Application
Write a function to extract all email addresses from text and get username and domain separately.
// Answer
function extractEmails(text) {
const emailPattern = /(?<username>[^\s@]+)@(?<domain>[^\s@]+\.[^\s@]+)/g;
const emails = [];
let match;
while ((match = emailPattern.exec(text)) !== null) {
emails.push({
full: match[0],
username: match.groups.username,
domain: match.groups.domain
});
}
return emails;
}
const text = "Contact us: admin@company.com or support@help.org";
console.log(extractEmails(text));
Summary
Grouping and capturing are core features of regular expressions:
- Basic Grouping (): Treats multiple characters as a whole, can apply quantifiers
- Capturing Groups: Save matched content, can be referenced later
- Non-Capturing Groups (?:): Provide grouping without saving content, improves performance
- Named Capturing Groups (?
) : Improve code readability and maintainability - Backreferences: Reference previously captured group content for complex matching
- Practical Applications: Data parsing, formatting, validation, etc.
Mastering grouping and capturing techniques allows us to write more powerful and flexible regular expressions.