Chapter 4: Position Matching and Boundaries
Chapter 4: Position Matching and Boundaries
Learning Objectives
- Master beginning-of-line (^) and end-of-line ($) anchors
- Learn to use word boundaries (\b) and non-word boundaries (\B)
- Understand string start (\A) and end (\Z) positions
- Master basic concepts of lookahead and lookbehind assertions
4.1 Overview of Anchors
Anchors do not match any characters, but rather match positions. They are used to ensure that a pattern appears at a specific location, which is crucial for precise matching.
Characteristics of Anchors
- Zero-width: do not consume characters
- Positional: match positions rather than characters
- Boundary: define boundary conditions for matching
4.2 Beginning-of-Line Anchor (^)
The beginning-of-line anchor ^ matches the start position of a line.
Basic Usage
^hello # Matches lines starting with "hello"
^\d+ # Matches lines starting with a digit
^[A-Z] # Matches lines starting with an uppercase letter
Practical Applications
// Validate if input starts with specific characters
const startsWithHello = /^hello/i;
console.log(startsWithHello.test("Hello world")); // true
console.log(startsWithHello.test("Say hello")); // false
// Match lines starting with digits
const text = `123 is a number
abc is letters
456 is also a number`;
const numberLines = text.split('\n').filter(line => /^\d/.test(line));
console.log(numberLines); // ["123 is a number", "456 is also a number"]
Beginning-of-Line in Multiline Mode
const multilineText = `First line
Second line
Third line`;
// Without multiline mode - only matches the beginning of the entire string
const singleMode = /^Second/;
console.log(singleMode.test(multilineText)); // false
// With multiline mode - matches the beginning of each line
const multiMode = /^Second/gm;
console.log(multilineText.match(multiMode)); // ["Second"]
4.3 End-of-Line Anchor ($)
The end-of-line anchor $ matches the end position of a line.
Basic Usage
world$ # Matches lines ending with "world"
\d+$ # Matches lines ending with digits
[.!?]$ # Matches lines ending with period, exclamation mark, or question mark
Practical Applications
// Validate file extension
const isTextFile = /\.txt$/;
console.log(isTextFile.test("document.txt")); // true
console.log(isTextFile.test("document.pdf")); // false
// Validate email format (simple version)
const emailPattern = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
console.log(emailPattern.test("user@example.com")); // true
console.log(emailPattern.test("user@example.com.")); // false
// Find lines ending with specific characters
const text = `file1.txt
file2.pdf
file3.txt
file4.doc`;
const txtFiles = text.split('\n').filter(line => /\.txt$/.test(line));
console.log(txtFiles); // ["file1.txt", "file3.txt"]
4.4 Combining Beginning and End Anchors
Exact Line Matching
// Match lines that are exactly "hello"
const exactMatch = /^hello$/;
console.log(exactMatch.test("hello")); // true
console.log(exactMatch.test("hello world")); // false
console.log(exactMatch.test("say hello")); // false
// Match empty lines
const emptyLine = /^$/;
// Match lines containing only whitespace
const blankLine = /^\s*$/;
Common Validation Patterns
// Phone number validation (11 digits)
const phonePattern = /^\d{11}$/;
// Postal code validation (6 digits)
const zipPattern = /^\d{6}$/;
// Username validation (3-16 alphanumeric or underscore, starting with letter)
const usernamePattern = /^[a-zA-Z][a-zA-Z0-9_]{2,15}$/;
// IP address validation (simplified)
const ipPattern = /^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/;
4.5 Word Boundary (\b)
The word boundary \b matches positions between word characters and non-word characters.
What are Word Characters
- Word characters: letters, digits, underscore
[a-zA-Z0-9_] - Non-word characters: spaces, punctuation, special characters, etc.
Basic Usage
\bword\b # Matches complete word "word"
\bcat # Matches words starting with "cat"
cat\b # Matches words ending with "cat"
Practical Applications
// Exact word matching, avoiding partial matches
const text = "I have a cat and a caterpillar";
// Without word boundary - will match "cat" in "caterpillar"
const withoutBoundary = /cat/g;
console.log(text.match(withoutBoundary)); // ["cat", "cat"]
// With word boundary - only matches complete "cat"
const withBoundary = /\bcat\b/g;
console.log(text.match(withBoundary)); // ["cat"]
// Find words starting with specific string
const startsWithPre = /\bpre\w*/g;
const text2 = "prefix, prepare, represent, pretty";
console.log(text2.match(startsWithPre)); // ["prefix", "prepare", "pretty"]
Word Boundary Positions
const text = "hello world";
// ^ ^ ^
// 1 2 3
// Position 1: String start, before h (word boundary)
// Position 2: Between o and space (word boundary)
// Position 3: After d, string end (word boundary)
const boundaries = /\b/g;
console.log(text.replace(boundaries, "|")); // "|hello| |world|"
4.6 Non-Word Boundary (\B)
The non-word boundary \B matches positions between two word characters or between two non-word characters.
Basic Usage
\Bword\B # Matches "word" surrounded by word characters
\Bcat # Matches "cat" not at the beginning of a word
cat\B # Matches "cat" not at the end of a word
Practical Applications
// Match patterns inside words
const text = "JavaScript and Java are different";
// Match "ava" inside words
const internal = /\Bava\B/g;
console.log(text.match(internal)); // ["ava"] (from JavaScript)
// Compare: match complete word "Java"
const wholeWord = /\bJava\b/g;
console.log(text.match(wholeWord)); // ["Java"]
// Replace specific pattern inside words
const result = text.replace(/\Bava\B/g, "XXX");
console.log(result); // "JXXXScript and Java are different"
4.7 String Start and End (\A, \z, \Z)
Some regex engines provide more precise string boundaries.
\A - String Start
// Similar to ^, but doesn't match line beginnings in multiline mode
// Note: JavaScript doesn't directly support \A, use ^ instead
\z and \Z - String End
// \z - Absolute end of string
// \Z - End of string (may be before final newline)
// Note: JavaScript doesn't directly support these, use $ instead
Implementation in JavaScript
// Simulate \A behavior
function stringStart(pattern) {
return new RegExp('^' + pattern.source, pattern.flags.replace('m', ''));
}
// Simulate \z behavior
function stringEnd(pattern) {
return new RegExp(pattern.source + '$', pattern.flags.replace('m', ''));
}
4.8 Preview of Assertions
Assertions are an advanced form of position matching. We’ll introduce them briefly here and cover them in detail in Chapter 6.
Lookahead Assertions
foo(?=bar) # Matches "foo" followed by "bar"
foo(?!bar) # Matches "foo" not followed by "bar"
Lookbehind Assertions
(?<=foo)bar # Matches "bar" preceded by "foo"
(?<!foo)bar # Matches "bar" not preceded by "foo"
Simple Examples
// Lookahead assertion example
const text = "foo123 foobar foobaz";
// Match "foo" followed by digits
const followedByDigit = /foo(?=\d)/g;
console.log(text.match(followedByDigit)); // ["foo"]
// Match "foo" not followed by "bar"
const notFollowedByBar = /foo(?!bar)/g;
console.log(text.match(notFollowedByBar)); // ["foo", "foo"]
4.9 Practical Use Cases
Password Validation
// Password must contain uppercase, lowercase letters and numbers, 8-16 characters
function validatePassword(password) {
const hasUpper = /[A-Z]/.test(password);
const hasLower = /[a-z]/.test(password);
const hasNumber = /\d/.test(password);
const validLength = /^.{8,16}$/.test(password);
return hasUpper && hasLower && hasNumber && validLength;
}
// More concise approach using lookahead assertions
const passwordPattern = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,16}$/;
Data Extraction
// Extract IP addresses and timestamps from logs
const logEntry = "2023-12-01 10:30:45 192.168.1.100 GET /api/users";
// Extract IP address (word boundaries ensure complete match)
const ipPattern = /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/;
const ip = logEntry.match(ipPattern)[0]; // "192.168.1.100"
// Extract timestamp (beginning anchor ensures correct position)
const timePattern = /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/;
const timestamp = logEntry.match(timePattern)[0]; // "2023-12-01 10:30:45"
Text Cleaning
// Clean excess whitespace
function cleanText(text) {
return text
.replace(/^\s+/gm, '') // Remove leading whitespace
.replace(/\s+$/gm, '') // Remove trailing whitespace
.replace(/\n{2,}/g, '\n\n') // Merge multiple newlines into two
.replace(/^$\n/gm, ''); // Remove empty lines
}
// Remove HTML tags but keep content
const removeHtmlTags = /<[^>]*>/g;
const htmlText = "<p>Hello <strong>world</strong>!</p>";
const plainText = htmlText.replace(removeHtmlTags, ''); // "Hello world!"
4.10 Common Pitfalls and Notes
Pitfall 1: ^ and $ in Multiline Mode
const text = "first line\nsecond line";
// Without multiline mode
console.log(/^second/.test(text)); // false
// With multiline mode
console.log(/^second/m.test(text)); // true
Pitfall 2: Word Boundary Definition
const text = "hello-world";
// \b is a boundary at the hyphen (because - is not a word character)
console.log(text.match(/\bhello\b/)); // ["hello"]
console.log(text.match(/\bworld\b/)); // ["world"]
// But this may not be what we want
console.log(text.match(/\bhello-world\b/)); // null
Pitfall 3: Empty Strings and Boundaries
// Empty string matches beginning and end
console.log(/^$/.test("")); // true
// But note \b behavior with empty strings
console.log(/\b/.test("")); // false (no word characters)
4.11 Practice Exercises
Exercise 1: Format Validation
Write regular expressions to validate the following formats:
- Chinese mobile phone number (11 digits, starting with 1)
- Email address (username@domain.suffix)
- URL (starting with http:// or https://)
// Answers
const phonePattern = /^1\d{10}$/;
const emailPattern = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
const urlPattern = /^https?:\/\/.+/;
Exercise 2: Text Processing
- Extract the first word of each sentence
- Match files ending with specific suffixes
- Remove line number prefixes
// Answers
const firstWords = /^\b\w+\b/gm;
const imageFiles = /\.(jpg|png|gif|bmp)$/i;
const removeNumbers = /^\d+\.\s*/gm;
const text = "1. First item\n2. Second item\n3. Third item";
const cleaned = text.replace(removeNumbers, '');
Exercise 3: Advanced Application
Write a function to validate if a password meets the following requirements:
- 8-20 characters long
- Contains uppercase letters
- Contains lowercase letters
- Contains numbers
- Contains special characters
// Answer
function validateStrongPassword(password) {
const patterns = [
/^.{8,20}$/, // Length check
/[A-Z]/, // Uppercase letter
/[a-z]/, // Lowercase letter
/\d/, // Number
/[!@#$%^&*(),.?":{}|<>]/ // Special character
];
return patterns.every(pattern => pattern.test(password));
}
// Or a more concise approach using lookahead assertions
const strongPasswordPattern = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*(),.?":{}|<>]).{8,20}$/;
Summary
Position matching and boundaries are key tools for precise regex matching:
- Beginning and End Anchors:
^and$are used to match the start and end of lines - Word Boundaries:
\bmatches word boundaries,\Bmatches non-word boundaries - Combined Use: Anchors can be combined for precise format validation
- Multiline Mode: Affects the behavior of
^and$ - Assertion Preview: Preparing for advanced position matching
Mastering these concepts is crucial for writing accurate regular expressions, as they help us precisely control the position and boundaries of matches.