Chapter 2: Basic Character Matching
Chapter 2: Basic Character Matching
Learning Objectives
- Master literal character matching
- Understand and use character classes [abc]
- Learn to use predefined character classes (\d, \w, \s, etc.)
- Master negation operations for character classes [^abc]
- Understand the special meaning of the dot (.)
2.1 Literal Character Matching
Literal character matching is the most basic function of regular expressions, directly matching characters in text.
Basic Examples
hello # Matches "hello"
123 # Matches "123"
Hello # Matches "Hello" (case-sensitive)
Special Characters That Need Escaping
Some characters have special meanings in regular expressions. If you want to match their literal meaning, you need to escape them with a backslash:
\. # Matches the dot "."
\* # Matches the asterisk "*"
\+ # Matches the plus "+"
\? # Matches the question mark "?"
\^ # Matches the caret "^"
\$ # Matches the dollar sign "$"
\| # Matches the pipe "|"
\\ # Matches the backslash "\"
\( # Matches the left parenthesis "("
\) # Matches the right parenthesis ")"
\[ # Matches the left bracket "["
\] # Matches the right bracket "]"
\{ # Matches the left brace "{"
\} # Matches the right brace "}"
2.2 Character Classes [abc]
Character classes are defined with square brackets and match any single character within the brackets.
Basic Character Classes
[abc] # Matches 'a', 'b', or 'c'
[123] # Matches '1', '2', or '3'
[aeiou] # Matches any vowel letter
[,.!?] # Matches comma, period, exclamation mark, or question mark
Character Ranges
Use hyphens to specify character ranges:
[a-z] # Matches any lowercase letter
[A-Z] # Matches any uppercase letter
[0-9] # Matches any digit
[a-zA-Z] # Matches any letter
[0-9a-f] # Matches hexadecimal digits
[a-zA-Z0-9] # Matches letters or digits
Combined Usage
[a-z0-9] # Matches lowercase letters or digits
[A-Za-z] # Matches any letter
[0-9.,] # Matches digits, comma, or period
2.3 Predefined Character Classes
To simplify common character classes, regular expressions provide predefined character classes:
Basic Predefined Character Classes
\d # Matches digits, equivalent to [0-9]
\w # Matches word characters, equivalent to [a-zA-Z0-9_]
\s # Matches whitespace characters (space, tab, newline, etc.)
Corresponding Negation Character Classes
\D # Matches non-digits, equivalent to [^0-9]
\W # Matches non-word characters, equivalent to [^a-zA-Z0-9_]
\S # Matches non-whitespace characters
Practical Application Examples
// JavaScript example
const phonePattern = /\d{3}-\d{3}-\d{4}/; // Matches 123-456-7890
const wordPattern = /\w+/; // Matches one or more word characters
const spacePattern = /\s+/; // Matches one or more whitespace characters
2.4 Negation of Character Classes [^abc]
Using the caret ^ as the first character in a character class negates the entire character class.
Negation Examples
[^abc] # Matches any character except 'a', 'b', 'c'
[^0-9] # Matches non-digit characters, equivalent to \D
[^a-zA-Z] # Matches non-letter characters
[^aeiou] # Matches non-vowel letters
[^\s] # Matches non-whitespace characters, equivalent to \S
Notes
[^] # Error: empty negation character class
[^\n] # Matches any character except newline
[^a-z\s] # Matches non-lowercase letters and non-whitespace characters
2.5 Special Meaning of the Dot (.)
The dot is one of the most commonly used metacharacters in regular expressions.
Basic Usage
. # Matches any character (usually not including newline)
c.t # Matches "cat", "cot", "cut", "c@t", etc.
h.llo # Matches "hello", "hallo", "h3llo", etc.
Practical Examples
// Match filename and extension
const filePattern = /.*\.txt$/; // Matches .txt files
// Match IP address (simplified version, not strict)
const ipPattern = /\d+\.\d+\.\d+\.\d+/; // Like 192.168.1.1
Limitations of the Dot
In most regular expression engines, the dot does not match newline characters by default:
. # Does not match \n (newline)
[\s\S] # Matches any character (including newline)
[^] # In some engines, matches any character (including newline)
2.6 Combined Application Examples
Validate Username
// Username can only contain letters, digits, and underscores, length 3-16
const usernamePattern = /^[a-zA-Z0-9_]{3,16}$/;
console.log(usernamePattern.test("user123")); // true
console.log(usernamePattern.test("user_name")); // true
console.log(usernamePattern.test("user@name")); // false
Extract Numbers
import re
text = "Order number: 12345, Amount: ¥199.99"
numbers = re.findall(r'\d+\.?\d*', text)
print(numbers) # ['12345', '199.99']
Clean Text
// Remove excess whitespace
const text = " hello world ";
const cleaned = text.replace(/\s+/g, ' ').trim();
console.log(cleaned); // "hello world"
2.7 Advanced Usage of Character Classes
Escaping in Character Classes
Inside character classes, certain characters need to be escaped:
[.] # Matches the dot (dot loses special meaning in character classes)
[\]] # Matches the right bracket
[\\] # Matches the backslash
[^-] # Matches characters except hyphen
[-] # Matches hyphen (at beginning or end)
[abc-] # Matches 'a', 'b', 'c', or '-'
[-abc] # Matches '-', 'a', 'b', or 'c'
POSIX Character Classes (supported by some engines)
[:alnum:] # Letters and digits
[:alpha:] # Letters
[:digit:] # Digits
[:lower:] # Lowercase letters
[:upper:] # Uppercase letters
[:space:] # Whitespace characters
Usage:
[[:alnum:]] # Matches letters or digits
[[:alpha:]] # Matches letters
2.8 Practice Exercises
Exercise 1: Basic Matching
Write regular expressions to match the following:
- Match any three digits
- Match words starting with an uppercase letter
- Match strings containing the @ symbol
// Sample answers
const threeDigits = /\d{3}/;
const capitalWord = /[A-Z]\w*/;
const hasAt = /.*@.*/;
Exercise 2: Character Class Application
Write regular expressions for:
- Match hexadecimal color codes (like #FF0000)
- Match words not containing digits
- Match words containing vowel letters
// Sample answers
const hexColor = /#[0-9a-fA-F]{6}/;
const noDigitWord = /[^\d\s]+/;
const hasVowel = /.*[aeiouAEIOU].*/;
Summary
Character matching is the foundation of regular expressions. Mastering literal characters, character classes, predefined character classes, and the usage of the dot provides the foundation for building complex regular expressions. Key points to remember:
- Literal characters match directly, special characters need escaping
- Character classes use
[abc]to match any one character within - Predefined character classes like
\d\w\ssimplify common matching [^abc]is used for negation matching- The dot
.matches any character (usually not including newline)
This foundational knowledge will be continuously used and expanded in subsequent chapters.