Chapter 4: Position Matching and Boundaries

Haiyue
13min

Chapter 4: Position Matching and Boundaries

Learning Objectives

  1. Master beginning-of-line (^) and end-of-line ($) anchors
  2. Learn to use word boundaries (\b) and non-word boundaries (\B)
  3. Understand string start (\A) and end (\Z) positions
  4. Master basic concepts of lookahead and lookbehind assertions

4.1 Overview of Anchors

Anchors do not match any characters, but rather match positions. They are used to ensure that a pattern appears at a specific location, which is crucial for precise matching.

Characteristics of Anchors

  • Zero-width: do not consume characters
  • Positional: match positions rather than characters
  • Boundary: define boundary conditions for matching

4.2 Beginning-of-Line Anchor (^)

The beginning-of-line anchor ^ matches the start position of a line.

Basic Usage

^hello      # Matches lines starting with "hello"
^\d+        # Matches lines starting with a digit
^[A-Z]      # Matches lines starting with an uppercase letter

Practical Applications

// Validate if input starts with specific characters
const startsWithHello = /^hello/i;
console.log(startsWithHello.test("Hello world"));  // true
console.log(startsWithHello.test("Say hello"));    // false

// Match lines starting with digits
const text = `123 is a number
abc is letters
456 is also a number`;

const numberLines = text.split('\n').filter(line => /^\d/.test(line));
console.log(numberLines); // ["123 is a number", "456 is also a number"]

Beginning-of-Line in Multiline Mode

const multilineText = `First line
Second line
Third line`;

// Without multiline mode - only matches the beginning of the entire string
const singleMode = /^Second/;
console.log(singleMode.test(multilineText)); // false

// With multiline mode - matches the beginning of each line
const multiMode = /^Second/gm;
console.log(multilineText.match(multiMode)); // ["Second"]

4.3 End-of-Line Anchor ($)

The end-of-line anchor $ matches the end position of a line.

Basic Usage

world$      # Matches lines ending with "world"
\d+$        # Matches lines ending with digits
[.!?]$      # Matches lines ending with period, exclamation mark, or question mark

Practical Applications

// Validate file extension
const isTextFile = /\.txt$/;
console.log(isTextFile.test("document.txt"));  // true
console.log(isTextFile.test("document.pdf"));  // false

// Validate email format (simple version)
const emailPattern = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
console.log(emailPattern.test("user@example.com"));     // true
console.log(emailPattern.test("user@example.com."));    // false

// Find lines ending with specific characters
const text = `file1.txt
file2.pdf
file3.txt
file4.doc`;

const txtFiles = text.split('\n').filter(line => /\.txt$/.test(line));
console.log(txtFiles); // ["file1.txt", "file3.txt"]

4.4 Combining Beginning and End Anchors

Exact Line Matching

// Match lines that are exactly "hello"
const exactMatch = /^hello$/;
console.log(exactMatch.test("hello"));        // true
console.log(exactMatch.test("hello world"));  // false
console.log(exactMatch.test("say hello"));    // false

// Match empty lines
const emptyLine = /^$/;

// Match lines containing only whitespace
const blankLine = /^\s*$/;

Common Validation Patterns

// Phone number validation (11 digits)
const phonePattern = /^\d{11}$/;

// Postal code validation (6 digits)
const zipPattern = /^\d{6}$/;

// Username validation (3-16 alphanumeric or underscore, starting with letter)
const usernamePattern = /^[a-zA-Z][a-zA-Z0-9_]{2,15}$/;

// IP address validation (simplified)
const ipPattern = /^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/;

4.5 Word Boundary (\b)

The word boundary \b matches positions between word characters and non-word characters.

What are Word Characters

  • Word characters: letters, digits, underscore [a-zA-Z0-9_]
  • Non-word characters: spaces, punctuation, special characters, etc.

Basic Usage

\bword\b    # Matches complete word "word"
\bcat       # Matches words starting with "cat"
cat\b       # Matches words ending with "cat"

Practical Applications

// Exact word matching, avoiding partial matches
const text = "I have a cat and a caterpillar";

// Without word boundary - will match "cat" in "caterpillar"
const withoutBoundary = /cat/g;
console.log(text.match(withoutBoundary)); // ["cat", "cat"]

// With word boundary - only matches complete "cat"
const withBoundary = /\bcat\b/g;
console.log(text.match(withBoundary)); // ["cat"]

// Find words starting with specific string
const startsWithPre = /\bpre\w*/g;
const text2 = "prefix, prepare, represent, pretty";
console.log(text2.match(startsWithPre)); // ["prefix", "prepare", "pretty"]

Word Boundary Positions

const text = "hello world";
//           ^    ^     ^
//           1    2     3
// Position 1: String start, before h (word boundary)
// Position 2: Between o and space (word boundary)
// Position 3: After d, string end (word boundary)

const boundaries = /\b/g;
console.log(text.replace(boundaries, "|")); // "|hello| |world|"

4.6 Non-Word Boundary (\B)

The non-word boundary \B matches positions between two word characters or between two non-word characters.

Basic Usage

\Bword\B    # Matches "word" surrounded by word characters
\Bcat       # Matches "cat" not at the beginning of a word
cat\B       # Matches "cat" not at the end of a word

Practical Applications

// Match patterns inside words
const text = "JavaScript and Java are different";

// Match "ava" inside words
const internal = /\Bava\B/g;
console.log(text.match(internal)); // ["ava"] (from JavaScript)

// Compare: match complete word "Java"
const wholeWord = /\bJava\b/g;
console.log(text.match(wholeWord)); // ["Java"]

// Replace specific pattern inside words
const result = text.replace(/\Bava\B/g, "XXX");
console.log(result); // "JXXXScript and Java are different"

4.7 String Start and End (\A, \z, \Z)

Some regex engines provide more precise string boundaries.

\A - String Start

// Similar to ^, but doesn't match line beginnings in multiline mode
// Note: JavaScript doesn't directly support \A, use ^ instead

\z and \Z - String End

// \z - Absolute end of string
// \Z - End of string (may be before final newline)
// Note: JavaScript doesn't directly support these, use $ instead

Implementation in JavaScript

// Simulate \A behavior
function stringStart(pattern) {
    return new RegExp('^' + pattern.source, pattern.flags.replace('m', ''));
}

// Simulate \z behavior
function stringEnd(pattern) {
    return new RegExp(pattern.source + '$', pattern.flags.replace('m', ''));
}

4.8 Preview of Assertions

Assertions are an advanced form of position matching. We’ll introduce them briefly here and cover them in detail in Chapter 6.

Lookahead Assertions

foo(?=bar)  # Matches "foo" followed by "bar"
foo(?!bar)  # Matches "foo" not followed by "bar"

Lookbehind Assertions

(?<=foo)bar # Matches "bar" preceded by "foo"
(?<!foo)bar # Matches "bar" not preceded by "foo"

Simple Examples

// Lookahead assertion example
const text = "foo123 foobar foobaz";

// Match "foo" followed by digits
const followedByDigit = /foo(?=\d)/g;
console.log(text.match(followedByDigit)); // ["foo"]

// Match "foo" not followed by "bar"
const notFollowedByBar = /foo(?!bar)/g;
console.log(text.match(notFollowedByBar)); // ["foo", "foo"]

4.9 Practical Use Cases

Password Validation

// Password must contain uppercase, lowercase letters and numbers, 8-16 characters
function validatePassword(password) {
    const hasUpper = /[A-Z]/.test(password);
    const hasLower = /[a-z]/.test(password);
    const hasNumber = /\d/.test(password);
    const validLength = /^.{8,16}$/.test(password);

    return hasUpper && hasLower && hasNumber && validLength;
}

// More concise approach using lookahead assertions
const passwordPattern = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,16}$/;

Data Extraction

// Extract IP addresses and timestamps from logs
const logEntry = "2023-12-01 10:30:45 192.168.1.100 GET /api/users";

// Extract IP address (word boundaries ensure complete match)
const ipPattern = /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/;
const ip = logEntry.match(ipPattern)[0]; // "192.168.1.100"

// Extract timestamp (beginning anchor ensures correct position)
const timePattern = /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/;
const timestamp = logEntry.match(timePattern)[0]; // "2023-12-01 10:30:45"

Text Cleaning

// Clean excess whitespace
function cleanText(text) {
    return text
        .replace(/^\s+/gm, '')    // Remove leading whitespace
        .replace(/\s+$/gm, '')    // Remove trailing whitespace
        .replace(/\n{2,}/g, '\n\n') // Merge multiple newlines into two
        .replace(/^$\n/gm, '');   // Remove empty lines
}

// Remove HTML tags but keep content
const removeHtmlTags = /<[^>]*>/g;
const htmlText = "<p>Hello <strong>world</strong>!</p>";
const plainText = htmlText.replace(removeHtmlTags, ''); // "Hello world!"

4.10 Common Pitfalls and Notes

Pitfall 1: ^ and $ in Multiline Mode

const text = "first line\nsecond line";

// Without multiline mode
console.log(/^second/.test(text)); // false

// With multiline mode
console.log(/^second/m.test(text)); // true

Pitfall 2: Word Boundary Definition

const text = "hello-world";

// \b is a boundary at the hyphen (because - is not a word character)
console.log(text.match(/\bhello\b/)); // ["hello"]
console.log(text.match(/\bworld\b/)); // ["world"]

// But this may not be what we want
console.log(text.match(/\bhello-world\b/)); // null

Pitfall 3: Empty Strings and Boundaries

// Empty string matches beginning and end
console.log(/^$/.test("")); // true

// But note \b behavior with empty strings
console.log(/\b/.test("")); // false (no word characters)

4.11 Practice Exercises

Exercise 1: Format Validation

Write regular expressions to validate the following formats:

  1. Chinese mobile phone number (11 digits, starting with 1)
  2. Email address (username@domain.suffix)
  3. URL (starting with http:// or https://)
// Answers
const phonePattern = /^1\d{10}$/;
const emailPattern = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
const urlPattern = /^https?:\/\/.+/;

Exercise 2: Text Processing

  1. Extract the first word of each sentence
  2. Match files ending with specific suffixes
  3. Remove line number prefixes
// Answers
const firstWords = /^\b\w+\b/gm;
const imageFiles = /\.(jpg|png|gif|bmp)$/i;
const removeNumbers = /^\d+\.\s*/gm;

const text = "1. First item\n2. Second item\n3. Third item";
const cleaned = text.replace(removeNumbers, '');

Exercise 3: Advanced Application

Write a function to validate if a password meets the following requirements:

  • 8-20 characters long
  • Contains uppercase letters
  • Contains lowercase letters
  • Contains numbers
  • Contains special characters
// Answer
function validateStrongPassword(password) {
    const patterns = [
        /^.{8,20}$/,           // Length check
        /[A-Z]/,               // Uppercase letter
        /[a-z]/,               // Lowercase letter
        /\d/,                  // Number
        /[!@#$%^&*(),.?":{}|<>]/ // Special character
    ];

    return patterns.every(pattern => pattern.test(password));
}

// Or a more concise approach using lookahead assertions
const strongPasswordPattern = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*(),.?":{}|<>]).{8,20}$/;

Summary

Position matching and boundaries are key tools for precise regex matching:

  1. Beginning and End Anchors: ^ and $ are used to match the start and end of lines
  2. Word Boundaries: \b matches word boundaries, \B matches non-word boundaries
  3. Combined Use: Anchors can be combined for precise format validation
  4. Multiline Mode: Affects the behavior of ^ and $
  5. Assertion Preview: Preparing for advanced position matching

Mastering these concepts is crucial for writing accurate regular expressions, as they help us precisely control the position and boundaries of matches.