Chapter 2: Basic Character Matching

Haiyue
8min

Chapter 2: Basic Character Matching

Learning Objectives

  1. Master literal character matching
  2. Understand and use character classes [abc]
  3. Learn to use predefined character classes (\d, \w, \s, etc.)
  4. Master negation operations for character classes [^abc]
  5. Understand the special meaning of the dot (.)

2.1 Literal Character Matching

Literal character matching is the most basic function of regular expressions, directly matching characters in text.

Basic Examples

hello       # Matches "hello"
123         # Matches "123"
Hello       # Matches "Hello" (case-sensitive)

Special Characters That Need Escaping

Some characters have special meanings in regular expressions. If you want to match their literal meaning, you need to escape them with a backslash:

\.          # Matches the dot "."
\*          # Matches the asterisk "*"
\+          # Matches the plus "+"
\?          # Matches the question mark "?"
\^          # Matches the caret "^"
\$          # Matches the dollar sign "$"
\|          # Matches the pipe "|"
\\          # Matches the backslash "\"
\(          # Matches the left parenthesis "("
\)          # Matches the right parenthesis ")"
\[          # Matches the left bracket "["
\]          # Matches the right bracket "]"
\{          # Matches the left brace "{"
\}          # Matches the right brace "}"

2.2 Character Classes [abc]

Character classes are defined with square brackets and match any single character within the brackets.

Basic Character Classes

[abc]       # Matches 'a', 'b', or 'c'
[123]       # Matches '1', '2', or '3'
[aeiou]     # Matches any vowel letter
[,.!?]      # Matches comma, period, exclamation mark, or question mark

Character Ranges

Use hyphens to specify character ranges:

[a-z]       # Matches any lowercase letter
[A-Z]       # Matches any uppercase letter
[0-9]       # Matches any digit
[a-zA-Z]    # Matches any letter
[0-9a-f]    # Matches hexadecimal digits
[a-zA-Z0-9] # Matches letters or digits

Combined Usage

[a-z0-9]    # Matches lowercase letters or digits
[A-Za-z]    # Matches any letter
[0-9.,]     # Matches digits, comma, or period

2.3 Predefined Character Classes

To simplify common character classes, regular expressions provide predefined character classes:

Basic Predefined Character Classes

\d          # Matches digits, equivalent to [0-9]
\w          # Matches word characters, equivalent to [a-zA-Z0-9_]
\s          # Matches whitespace characters (space, tab, newline, etc.)

Corresponding Negation Character Classes

\D          # Matches non-digits, equivalent to [^0-9]
\W          # Matches non-word characters, equivalent to [^a-zA-Z0-9_]
\S          # Matches non-whitespace characters

Practical Application Examples

// JavaScript example
const phonePattern = /\d{3}-\d{3}-\d{4}/;  // Matches 123-456-7890
const wordPattern = /\w+/;                 // Matches one or more word characters
const spacePattern = /\s+/;                // Matches one or more whitespace characters

2.4 Negation of Character Classes [^abc]

Using the caret ^ as the first character in a character class negates the entire character class.

Negation Examples

[^abc]      # Matches any character except 'a', 'b', 'c'
[^0-9]      # Matches non-digit characters, equivalent to \D
[^a-zA-Z]   # Matches non-letter characters
[^aeiou]    # Matches non-vowel letters
[^\s]       # Matches non-whitespace characters, equivalent to \S

Notes

[^]         # Error: empty negation character class
[^\n]       # Matches any character except newline
[^a-z\s]    # Matches non-lowercase letters and non-whitespace characters

2.5 Special Meaning of the Dot (.)

The dot is one of the most commonly used metacharacters in regular expressions.

Basic Usage

.           # Matches any character (usually not including newline)
c.t         # Matches "cat", "cot", "cut", "c@t", etc.
h.llo       # Matches "hello", "hallo", "h3llo", etc.

Practical Examples

// Match filename and extension
const filePattern = /.*\.txt$/;  // Matches .txt files

// Match IP address (simplified version, not strict)
const ipPattern = /\d+\.\d+\.\d+\.\d+/;  // Like 192.168.1.1

Limitations of the Dot

In most regular expression engines, the dot does not match newline characters by default:

.           # Does not match \n (newline)
[\s\S]      # Matches any character (including newline)
[^]         # In some engines, matches any character (including newline)

2.6 Combined Application Examples

Validate Username

// Username can only contain letters, digits, and underscores, length 3-16
const usernamePattern = /^[a-zA-Z0-9_]{3,16}$/;

console.log(usernamePattern.test("user123"));    // true
console.log(usernamePattern.test("user_name"));  // true
console.log(usernamePattern.test("user@name"));  // false

Extract Numbers

import re

text = "Order number: 12345, Amount: ¥199.99"
numbers = re.findall(r'\d+\.?\d*', text)
print(numbers)  # ['12345', '199.99']

Clean Text

// Remove excess whitespace
const text = "  hello   world  ";
const cleaned = text.replace(/\s+/g, ' ').trim();
console.log(cleaned);  // "hello world"

2.7 Advanced Usage of Character Classes

Escaping in Character Classes

Inside character classes, certain characters need to be escaped:

[.]         # Matches the dot (dot loses special meaning in character classes)
[\]]        # Matches the right bracket
[\\]        # Matches the backslash
[^-]        # Matches characters except hyphen
[-]         # Matches hyphen (at beginning or end)
[abc-]      # Matches 'a', 'b', 'c', or '-'
[-abc]      # Matches '-', 'a', 'b', or 'c'

POSIX Character Classes (supported by some engines)

[:alnum:]   # Letters and digits
[:alpha:]   # Letters
[:digit:]   # Digits
[:lower:]   # Lowercase letters
[:upper:]   # Uppercase letters
[:space:]   # Whitespace characters

Usage:

[[:alnum:]] # Matches letters or digits
[[:alpha:]] # Matches letters

2.8 Practice Exercises

Exercise 1: Basic Matching

Write regular expressions to match the following:

  1. Match any three digits
  2. Match words starting with an uppercase letter
  3. Match strings containing the @ symbol
// Sample answers
const threeDigits = /\d{3}/;
const capitalWord = /[A-Z]\w*/;
const hasAt = /.*@.*/;

Exercise 2: Character Class Application

Write regular expressions for:

  1. Match hexadecimal color codes (like #FF0000)
  2. Match words not containing digits
  3. Match words containing vowel letters
// Sample answers
const hexColor = /#[0-9a-fA-F]{6}/;
const noDigitWord = /[^\d\s]+/;
const hasVowel = /.*[aeiouAEIOU].*/;

Summary

Character matching is the foundation of regular expressions. Mastering literal characters, character classes, predefined character classes, and the usage of the dot provides the foundation for building complex regular expressions. Key points to remember:

  1. Literal characters match directly, special characters need escaping
  2. Character classes use [abc] to match any one character within
  3. Predefined character classes like \d \w \s simplify common matching
  4. [^abc] is used for negation matching
  5. The dot . matches any character (usually not including newline)

This foundational knowledge will be continuously used and expanded in subsequent chapters.