Chapter 9: Regular Expressions in Programming Languages
Haiyue
15min
Chapter 9: Regular Expressions in Programming Languages
Learning Objectives
- Master the use of RegExp object in JavaScript
- Learn Python’s re module operations
- Understand Java’s Pattern and Matcher classes
- Master PHP’s PCRE functions
- Understand syntax differences between languages
9.1 Regular Expressions in JavaScript
JavaScript provides powerful regular expression support, primarily through the RegExp object and related string methods.
RegExp Object
// Two ways to create regular expressions
const regex1 = /pattern/flags; // Literal syntax
const regex2 = new RegExp('pattern', 'flags'); // Constructor syntax
// RegExp properties
const pattern = /hello/gi;
console.log(pattern.source); // "hello"
console.log(pattern.flags); // "gi"
console.log(pattern.global); // true
console.log(pattern.ignoreCase); // true
console.log(pattern.multiline); // false
console.log(pattern.lastIndex); // 0
String Methods
const text = "Hello World, Hello Universe";
// match() - Returns match results
console.log(text.match(/hello/i)); // ["Hello"]
console.log(text.match(/hello/gi)); // ["Hello", "Hello"]
// search() - Returns match position
console.log(text.search(/world/i)); // 6
// replace() - Replaces matched content
console.log(text.replace(/hello/gi, 'Hi')); // "Hi World, Hi Universe"
// split() - Splits by pattern
console.log("a,b;c:d".split(/[,;:]/)); // ["a", "b", "c", "d"]
// matchAll() - ES2020 addition, returns iterator of all matches
const matches = [...text.matchAll(/(\w+)/g)];
matches.forEach(match => console.log(match[0]));
RegExp Methods
const pattern = /\d{2,4}/g;
const text = "12 345 6789";
// test() - Tests if matches
console.log(pattern.test(text)); // true
// exec() - Executes match
let match;
while ((match = pattern.exec(text)) !== null) {
console.log(`Match: ${match[0]}, Position: ${match.index}`);
}
// Match: 12, Position: 0
// Match: 345, Position: 3
// Match: 6789, Position: 7
Practical Application Example
// Email Validator
class EmailValidator {
constructor() {
this.pattern = /^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/;
}
validate(email) {
return this.pattern.test(email);
}
extractDomain(email) {
const match = email.match(/@(.+)$/);
return match ? match[1] : null;
}
}
9.2 Python’s re Module
Python’s re module provides complete regular expression functionality.
Basic Functions
import re
# Compile regular expression
pattern = re.compile(r'\d+')
# match() - Matches from string beginning
result = re.match(r'\d+', '123abc')
if result:
print(result.group()) # "123"
# search() - Searches for first match in entire string
result = re.search(r'\d+', 'abc123def')
if result:
print(result.group()) # "123"
# findall() - Finds all matches
numbers = re.findall(r'\d+', '12 cats, 34 dogs, 56 birds')
print(numbers) # ['12', '34', '56']
# finditer() - Returns iterator of match objects
for match in re.finditer(r'\d+', '12 cats, 34 dogs'):
print(f"Match: {match.group()}, Position: {match.start()}-{match.end()}")
# sub() - Replaces matched content
text = re.sub(r'\d+', '[number]', '12 cats, 34 dogs')
print(text) # "[number] cats, [number] dogs"
# split() - Splits string by pattern
parts = re.split(r'[,;:]', 'a,b;c:d')
print(parts) # ['a', 'b', 'c', 'd']
Groups and Captures
# Group capturing
text = "John Doe, 25 years old"
pattern = r'(\w+) (\w+), (\d+) years old'
match = re.search(pattern, text)
if match:
print(match.group(0)) # Complete match
print(match.group(1)) # "John"
print(match.group(2)) # "Doe"
print(match.group(3)) # "25"
print(match.groups()) # ('John', 'Doe', '25')
# Named groups
pattern = r'(?P<first>\w+) (?P<last>\w+), (?P<age>\d+) years old'
match = re.search(pattern, text)
if match:
print(match.groupdict()) # {'first': 'John', 'last': 'Doe', 'age': '25'}
Compilation Options
# Common flags
flags = {
're.IGNORECASE': re.IGNORECASE, # or re.I, case insensitive
're.MULTILINE': re.MULTILINE, # or re.M, multiline mode
're.DOTALL': re.DOTALL, # or re.S, dot matches newline
're.VERBOSE': re.VERBOSE, # or re.X, verbose mode
're.ASCII': re.ASCII, # or re.A, ASCII mode
}
# Verbose mode example
verbose_pattern = re.compile(r'''
^ # Start of line
([a-zA-Z0-9._%+-]+) # Username part
@ # @ symbol
([a-zA-Z0-9.-]+) # Domain part
\. # Dot
([a-zA-Z]{2,}) # Top-level domain
$ # End of line
''', re.VERBOSE)
Practical Application Example
class LogAnalyzer:
def __init__(self):
# Log format: timestamp [level] message
self.log_pattern = re.compile(
r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) '
r'\[(?P<level>\w+)\] '
r'(?P<message>.*)'
)
def parse_log(self, log_line):
match = self.log_pattern.match(log_line)
if match:
return match.groupdict()
return None
def filter_logs(self, log_content, level='ERROR'):
lines = log_content.split('\n')
filtered = []
for line in lines:
parsed = self.parse_log(line)
if parsed and parsed['level'] == level:
filtered.append(parsed)
return filtered
# Usage example
analyzer = LogAnalyzer()
log_data = """2023-12-01 10:00:00 [INFO] Application started
2023-12-01 10:01:00 [ERROR] Database connection failed
2023-12-01 10:02:00 [WARN] Memory usage high"""
errors = analyzer.filter_logs(log_data, 'ERROR')
print(errors)
9.3 Regular Expressions in Java
Java provides regular expression support through the java.util.regex package.
Pattern and Matcher Classes
import java.util.regex.*;
public class RegexExample {
public static void main(String[] args) {
// Compile pattern
Pattern pattern = Pattern.compile("\\d+");
// Create matcher
String text = "12 cats, 34 dogs, 56 birds";
Matcher matcher = pattern.matcher(text);
// Find all matches
while (matcher.find()) {
System.out.println("Match: " + matcher.group() +
", Position: " + matcher.start() + "-" + matcher.end());
}
// Replace matches
String replaced = pattern.matcher(text).replaceAll("[number]");
System.out.println(replaced);
// Split string
Pattern splitPattern = Pattern.compile("[,;:]");
String[] parts = splitPattern.split("a,b;c:d");
for (String part : parts) {
System.out.println(part);
}
}
}
Group Operations
public class GroupExample {
public static void main(String[] args) {
String text = "John Doe, 25 years old";
Pattern pattern = Pattern.compile("(\\w+) (\\w+), (\\d+) years old");
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
System.out.println("First name: " + matcher.group(1));
System.out.println("Last name: " + matcher.group(2));
System.out.println("Age: " + matcher.group(3));
System.out.println("Group count: " + matcher.groupCount());
}
}
}
Utility Class
public class RegexUtils {
// Email validation
private static final Pattern EMAIL_PATTERN = Pattern.compile(
"^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"
);
// Phone validation (China)
private static final Pattern PHONE_PATTERN = Pattern.compile("^1[3-9]\\d{9}$");
public static boolean isValidEmail(String email) {
return EMAIL_PATTERN.matcher(email).matches();
}
public static boolean isValidPhone(String phone) {
return PHONE_PATTERN.matcher(phone).matches();
}
// Extract URLs
public static List<String> extractUrls(String text) {
Pattern urlPattern = Pattern.compile(
"https?://[\\w\\.-]+(?:\\:[0-9]+)?(?:/[\\w\\._~:/?#\\[\\]@!$&'()*+,;=-]*)?",
Pattern.CASE_INSENSITIVE
);
Matcher matcher = urlPattern.matcher(text);
List<String> urls = new ArrayList<>();
while (matcher.find()) {
urls.add(matcher.group());
}
return urls;
}
}
9.4 Regular Expressions in PHP
PHP provides PCRE (Perl Compatible Regular Expressions) functions.
Basic Functions
<?php
// preg_match - Execute match
$text = "Hello World 123";
if (preg_match('/\d+/', $text, $matches)) {
echo "Match: " . $matches[0] . "\n"; // "123"
}
// preg_match_all - Execute global match
$text = "12 cats, 34 dogs, 56 birds";
preg_match_all('/\d+/', $text, $matches);
print_r($matches[0]); // Array([0] => 12 [1] => 34 [2] => 56)
// preg_replace - Replace matches
$result = preg_replace('/\d+/', '[number]', $text);
echo $result . "\n"; // "[number] cats, [number] dogs, [number] birds"
// preg_split - Split string
$parts = preg_split('/[,;:]/', 'a,b;c:d');
print_r($parts); // Array([0] => a [1] => b [2] => c [3] => d)
// preg_grep - Filter array
$items = array('apple', 'banana', 'cherry', 'date');
$filtered = preg_grep('/a/', $items);
print_r($filtered); // Items containing letter 'a'
?>
Groups and Captures
<?php
$text = "John Doe, 25 years old";
$pattern = '/(\w+) (\w+), (\d+) years old/';
if (preg_match($pattern, $text, $matches)) {
echo "Full match: " . $matches[0] . "\n";
echo "First name: " . $matches[1] . "\n";
echo "Last name: " . $matches[2] . "\n";
echo "Age: " . $matches[3] . "\n";
}
// Named capture groups (PHP 7.2+)
$pattern = '/(?<first>\w+) (?<last>\w+), (?<age>\d+) years old/';
if (preg_match($pattern, $text, $matches)) {
echo "First name: " . $matches['first'] . "\n";
echo "Last name: " . $matches['last'] . "\n";
echo "Age: " . $matches['age'] . "\n";
}
?>
Utility Class
<?php
class RegexValidator {
// Email validation
public static function validateEmail($email) {
$pattern = '/^[a-zA-Z0-9.!#$%&\'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/';
return preg_match($pattern, $email) === 1;
}
// Phone validation (China)
public static function validatePhone($phone) {
$pattern = '/^1[3-9]\d{9}$/';
return preg_match($pattern, $phone) === 1;
}
// Extract all links
public static function extractLinks($text) {
$pattern = '/https?:\/\/[\w\.-]+(?:\:[0-9]+)?(?:\/[\w\._~:\/?#\[\]@!$&\'()*+,;=-]*)?/i';
preg_match_all($pattern, $text, $matches);
return $matches[0];
}
// Format phone number
public static function formatPhone($phone) {
$clean = preg_replace('/\D/', '', $phone);
if (preg_match('/^(\d{3})(\d{4})(\d{4})$/', $clean, $matches)) {
return $matches[1] . '-' . $matches[2] . '-' . $matches[3];
}
return $phone;
}
}
// Usage example
echo RegexValidator::validateEmail('test@example.com') ? 'Valid' : 'Invalid';
echo "\n";
$links = RegexValidator::extractLinks('Visit https://example.com or http://test.org');
print_r($links);
echo RegexValidator::formatPhone('13812345678') . "\n"; // 138-1234-5678
?>
9.5 Comparison of Language Differences
Syntax Differences
// JavaScript - Less escaping
const jsRegex = /\d+/g;
// Java - Requires double escaping
Pattern javaPattern = Pattern.compile("\\d+");
// Python - Use raw strings
pattern = re.compile(r'\d+')
// PHP - Requires delimiters
$phpPattern = '/\d+/';
Modifier Differences
// JavaScript
const flags = 'gimsuvy';
// Python
flags = re.I | re.M | re.S
// Java
Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE)
// PHP
'/pattern/ims'
Feature Comparison
| Feature | JavaScript | Python | Java | PHP |
|---------|------------|--------|------|-----|
| Named groups | ✓ (ES2018) | ✓ | ✓ | ✓ (7.2+) |
| Lookbehind | ✓ (ES2018) | ✓ | ✓ | ✓ |
| Recursive patterns | ✗ | ✗ | ✗ | ✓ |
| Conditional patterns | ✗ | ✗ | ✗ | ✓ |
| Atomic groups | ✗ | ✗ | ✓ | ✓ |
9.6 Cross-Language Regular Expression Best Practices
Common Pattern Library
// Create cross-language compatible patterns
const patterns = {
email: {
js: /^[^\s@]+@[^\s@]+\.[^\s@]+$/,
python: r'^[^\s@]+@[^\s@]+\.[^\s@]+$',
java: "^[^\\s@]+@[^\\s@]+\\.[^\\s@]+$",
php: '/^[^\s@]+@[^\s@]+\.[^\s@]+$/'
},
phone: {
js: /^\d{3}-?\d{3}-?\d{4}$/,
python: r'^\d{3}-?\d{3}-?\d{4}$',
java: "^\\d{3}-?\\d{3}-?\\d{4}$",
php: '/^\d{3}-?\d{3}-?\d{4}$/'
}
};
Unified Validation Functions
// JavaScript version
function createValidator(patterns) {
return {
email: (value) => patterns.email.js.test(value),
phone: (value) => patterns.phone.js.test(value)
};
}
# Python version
import re
def create_validator(patterns):
compiled = {
key: re.compile(patterns[key]['python'])
for key in patterns
}
return {
'email': lambda value: compiled['email'].match(value) is not None,
'phone': lambda value: compiled['phone'].match(value) is not None
}
Summary
This chapter detailed the use of regular expressions in mainstream programming languages:
- JavaScript: RegExp object, string methods, ES2018 new features
- Python: re module, compilation options, named groups
- Java: Pattern/Matcher classes, static methods, performance optimization
- PHP: PCRE functions, delimiters, rich advanced features
- Language Differences: Comparison of syntax, escaping, modifiers, and feature support
- Best Practices: Cross-language compatibility, common pattern design
Understanding these differences helps in correctly using regular expressions in different language environments and writing maintainable code.