A regular expression literal is
a sequence of characters enclosed in back quotes. Several characters
have special meanings within regular expression literals. These characters
have their special meaning unless preceded by a back slash. The special
characters are ., \, [, ], ?, *, +, ^, and $.
A regular expression literal can contain character escape sequences.
The allowed escape sequences are \t, \r, \n, and \xdd,
which stand for a tab, a carriage return, a new line, and the character
with the ASCII code equal to the hexadecimal number dd.
Regular expressions have a recursive structure. They
are formed from smaller subexpressions. The building blocks of all
regular expressions are the expressions to match a single character.
These fundamental expressions have the following three forms:
- Period (.) is a regular expression that matches
any single character. For example, `.` matches a, 5,
#, \n, ", and so on.
- Any character other than a special character, or
a special character preceded by a back slash, is a regular expression
that matches that character. For example, `a`, `5`, and `\*` match a, 5,
and *, respectively.
- A set of characters enclosed in square brackets
is a regular expression that matches any one character in the set.
For example, `[abc]` matches any one of a, b, or c.
If the first character in the set is a caret (^), then the regular
expression matches the complement of the given set of characters.
So `[^abc]` matches any character other than a, b, and c.
As a matter of convenience, a range of characters can be specified
with a dash. The range includes all characters between the lower and
upper bounds, inclusively. For example, `[a-zA-Z0-9]` matches any
letter or digit.
From these building blocks, larger regular
expressions can be formed in the following way:
- If re1 and re2 are
regular expressions, then re1 re2 (concatenation)
is a regular expression that matches all strings of the form s1s2,
where s1 is matchable by re1 and s2 is
matchable by re2. For example, `[ab][01]`
matches a0, a1, b0, and b1.
- If re is a regular expression, then re? is
a regular expression that matches zero or one occurrence of re.
For example, `ab?` matches a or ab, and `a[01]?` matches a, a0,
or a1.
- If re is a regular expression, then re* is
a regular expression that matches zero or more occurrences of re.
For example, `ab*` matches a, ab, abb, abbb,
…, and `a[01]*` matches a, a0, a1, a00, a01, a11,
….
- If re is a regular expression, then re+ is
a regular expression that matches one or more occurrences of re.
For example, `ab+` matches ab, abb, abbb, …,
and `a[01]+` matches a0, a1, a00, a01, a11,
….
Note: The postfix operators
?, *, and
+ bind
more tightly than concatenation. Therefore, ab* means
a(b*) and
not
(ab)*.
A complete regular expression can be anchored
to the beginning or end of a string with ^ and $, respectively. If re is
a regular expression, then ^re is a regular expression that
matches all strings matchable by re but only if they occur
at the beginning of a string. Similarly, re$ is
a regular expression that matches all strings matchable by re but
only if they occur at the end of a string. For example, `^[01]+` matches 0 and 0110 but
not a0 or a0110; and `[01]+$` matches 0 and 0110 but
not 0a or 0110a.