JavaScript: The Definitive GuideJavaScript: The Definitive GuideSearch this book

Chapter 10. Pattern Matching with Regular Expressions

Contents:

Defining Regular Expressions
String Methods for Pattern Matching
The RegExp Object

A regular expression is an object that describes a pattern of characters. The JavaScript RegExp class represents regular expressions, and both String and RegExp define methods that use regular expressions to perform powerful pattern-matching and search-and-replace functions on text.[36]

[36]The term "regular expression" is an obscure one that dates back many years. The syntax used to describe a textual pattern is indeed a type of expression. However, as we'll see, that syntax is far from regular! A regular expression is sometimes called a "regexp" or even an "RE."

JavaScript regular expressions were standardized in ECMAScript v3. JavaScript 1.2 implements a subset of the regular expression features required by ECMAScript v3, and JavaScript 1.5 implements the full standard. JavaScript regular expressions are strongly based on the regular expression facilities of the Perl programming language. Roughly speaking, we can say that JavaScript 1.2 implements Perl 4 regular expressions, and JavaScript 1.5 implements a large subset of Perl 5 regular expressions.

This chapter begins by defining the syntax that regular expressions use to describe textual patterns. Then it moves on to describe the String and RegExp methods that use regular expressions.

10.1. Defining Regular Expressions

In JavaScript, regular expressions are represented by RegExp objects. RegExp objects may be created with the RegExp( ) constructor, of course, but they are more often created using a special literal syntax. Just as string literals are specified as characters within quotation marks, regular expression literals are specified as characters within a pair of slash (/) characters. Thus, your JavaScript code may contain lines like this:

var pattern = /s$/; 

This line creates a new RegExp object and assigns it to the variable pattern. This particular RegExp object matches any string that ends with the letter "s". (We'll talk about the grammar for defining patterns shortly.) This regular expression could have equivalently been defined with the RegExp( ) constructor like this:

var pattern = new RegExp("s$"); 

Creating a RegExp object, either literally or with the RegExp( ) constructor, is the easy part. The more difficult task is describing the desired pattern of characters using regular expression syntax. JavaScript adopts a fairly complete subset of the regular expression syntax used by Perl, so if you are an experienced Perl programmer, you already know how to describe patterns in JavaScript.

Regular expression pattern specifications consist of a series of characters. Most characters, including all alphanumeric characters, simply describe characters to be matched literally. Thus, the regular expression /java/ matches any string that contains the substring "java". Other characters in regular expressions are not matched literally, but have special significance. For example, the regular expression /s$/ contains two characters. The first, "s", matches itself literally. The second, "$", is a special metacharacter that matches the end of a string. Thus, this regular expression matches any string that contains the letter "s" as its last character.

The following sections describe the various characters and metacharacters used in JavaScript regular expressions. Note, however, that a complete tutorial on regular expression grammar is beyond the scope of this book. For complete details of the syntax, consult a book on Perl, such as Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant (O'Reilly). Mastering Regular Expressions, by Jeffrey E.F. Friedl (O'Reilly), is another excellent source of information on regular expressions.

10.1.1. Literal Characters

As we've seen, all alphabetic characters and digits match themselves literally in regular expressions. JavaScript regular expression syntax also supports certain nonalphabetic characters through escape sequences that begin with a backslash (\). For example, the sequence \n matches a literal newline character in a string. Table 10-1 lists these characters.

Table 10-1. Regular expression literal characters

Character

Matches

Alphanumeric character

Itself

\0

The NUL character (\u0000)

\t

Tab (\u0009)

\n

Newline (\u000A)

\v

Vertical tab (\u000B)

\f

Form feed (\u000C)

\r

Carriage return (\u000D)

\xnn

The Latin character specified by the hexadecimal number nn; for example, \x0A is the same as \n

\uxxxx

The Unicode character specified by the hexadecimal number xxxx; for example, \u0009 is the same as \t

\cX

The control character ^X; for example, \cJ is equivalent to the newline character \n

A number of punctuation characters have special meanings in regular expressions. They are:

^ $ . * + ? = ! : | \ / ( ) [ ] { }  

We'll learn the meanings of these characters in the sections that follow. Some of these characters have special meaning only within certain contexts of a regular expression and are treated literally in other contexts. As a general rule, however, if you want to include any of these punctuation characters literally in a regular expression, you must precede them with a \. Other punctuation characters, such as quotation marks and @, do not have special meaning and simply match themselves literally in a regular expression.

If you can't remember exactly which punctuation characters need to be escaped with a backslash, you may safely place a backslash before any punctuation character. On the other hand, note that many letters and numbers have special meaning when preceded by a backslash, so any letters or numbers that you want to match literally should not be escaped with a backslash. To include a backslash character literally in a regular expression, you must escape it with a backslash, of course. For example, the following regular expression matches any string that includes a backslash: /\\/.

10.1.2. Character Classes

Individual literal characters can be combined into character classes by placing them within square brackets. A character class matches any one character that is contained within it. Thus, the regular expression /[abc]/ matches any one of the letters a, b, or c. Negated character classes can also be defined -- these match any character except those contained within the brackets. A negated character class is specified by placing a caret (^) as the first character inside the left bracket. The regexp /[^abc]/ matches any one character other than a, b, or c. Character classes can use a hyphen to indicate a range of characters. To match any one lowercase character from the Latin alphabet, use /[a-z]/, and to match any letter or digit from the Latin alphabet, use /[a-zA-Z0-9]/.

Because certain character classes are commonly used, the JavaScript regular expression syntax includes special characters and escape sequences to represent these common classes. For example, \s matches the space character, the tab character, and any other Unicode whitespace character, and \S matches any character that is not Unicode whitespace. Table 10-2 lists these characters and summarizes character class syntax. (Note that several of these character class escape sequences match only ASCII characters and have not been extended to work with Unicode characters. You can explicitly define your own Unicode character classes; for example, /[\u0400-04FF]/ matches any one Cyrillic character.)

Table 10-2. Regular expression character classes

Character

Matches

[...]

Any one character between the brackets.

[^...]

Any one character not between the brackets.

.

Any character except newline or another Unicode line terminator.

\w

Any ASCII word character. Equivalent to [a-zA-Z0-9_].

\W

Any character that is not an ASCII word character. Equivalent to [^a-zA-Z0-9_].

\s

Any Unicode whitespace character.

\S

Any character that is not Unicode whitespace. Note that \w and \S are not the same thing.

\d

Any ASCII digit. Equivalent to [0-9].

\D

Any character other than an ASCII digit. Equivalent to [^0-9].

[\b]

A literal backspace (special case).

Note that the special character class escapes can be used within square brackets. \s matches any whitespace character and \d matches any digit, so /[\s\d]/ matches any one whitespace character or digit. Note that there is one special case. As we'll see later, the \b escape has a special meaning. When used within a character class, however, it represents the backspace character. Thus, to represent a backspace character literally in a regular expression, use the character class with one element: /[\b]/.

10.1.3. Repetition

With the regular expression syntax we have learned so far, we can describe a two-digit number as /\d\d/ and a four-digit number as /\d\d\d\d/. But we don't have any way to describe, for example, a number that can have any number of digits or a string of three letters followed by an optional digit. These more complex patterns use regular expression syntax that specifies how many times an element of a regular expression may be repeated.

The characters that specify repetition always follow the pattern to which they are being applied. Because certain types of repetition are quite commonly used, there are special characters to represent these cases. For example, + matches one or more occurrences of the previous pattern. Table 10-3 summarizes the repetition syntax. The following lines show some examples:

/\d{2,4}/     // Match between two and four digits
/\w{3}\d?/    // Match exactly three word characters and an optional digit
/\s+java\s+/  // Match "java" with one or more spaces before and after
/[^"]*/       // Match zero or more non-quote characters

Table 10-3. Regular expression repetition characters

Character

Meaning

{n,m}

Match the previous item at least n times but no more than m times.

{n,}

Match the previous item n or more times.

{n}

Match exactly n occurrences of the previous item.

?

Match zero or one occurrences of the previous item. That is, the previous item is optional. Equivalent to {0,1}.

+

Match one or more occurrences of the previous item. Equivalent to {1,}.

*

Match zero or more occurrences of the previous item. Equivalent to {0,}.

Be careful when using the * and ? repetition characters. Since these characters may match zero instances of whatever precedes them, they are allowed to match nothing. For example, the regular expression /a*/ actually matches the string "bbbb", because the string contains zero occurrences of the letter a!

10.1.3.1. Non-greedy repetition

The repetition characters listed in Table 10-3 match as many times as possible while still allowing any following parts of the regular expression to match. We say that the repetition is "greedy." It is also possible (in JavaScript 1.5 and later -- this is one of the Perl 5 features not implemented in JavaScript 1.2) to specify that repetition should be done in a non-greedy way. Simply follow the repetition character or characters with a question mark: ??, +?, *?, or even {1,5}?. For example, the regular expression /a+/ matches one or more occurrences of the letter a. When applied to the string "aaa", it matches all three letters. But /a+?/ matches one or more occurrences of the letter a, matching as few characters as necessary. When applied to the same string, this pattern matches only the first letter a.

Using non-greedy repetition may not always produce the results you expect. Consider the pattern /a*b/, which matches zero or more letters a followed by the letter b. When applied to the string "aaab", it matches the entire string. Now let's use the non-greedy version: /a*?b/. This should match the letter b preceded by the fewest number of a's possible. When applied to the same string "aaab", you might expect it to match only the last letter b. In fact, however, this pattern matches the entire string as well, just like the greedy version of the pattern. This is because regular expression pattern matching is done by finding the first position in the string at which a match is possible. The non-greedy version of our pattern does match at the first character of the string, so this match is returned; matches at subsequent characters are never even considered.

10.1.4. Alternation, Grouping, and References

The regular expression grammar includes special characters for specifying alternatives, grouping subexpressions, and referring to previous subexpressions. The | character separates alternatives. For example, /ab|cd|ef/ matches the string "ab" or the string "cd" or the string "ef". And /\d{3}|[a-z]{4}/ matches either three digits or four lowercase letters.

Note that alternatives are considered left to right until a match is found. If the left alternative matches, the right alternative is ignored, even if it would have produced a "better" match. Thus, when the pattern /a|ab/ is applied to the string "ab", it matches only the first letter.

Parentheses have several purposes in regular expressions. One purpose is to group separate items into a single subexpression, so that the items can be treated as a single unit by |, *, +, ?, and so on. For example, /java(script)?/ matches "java" followed by the optional "script". And /(ab|cd)+|ef)/ matches either the string "ef" or one or more repetitions of either of the strings "ab" or "cd".

Another purpose of parentheses in regular expressions is to define subpatterns within the complete pattern. When a regular expression is successfully matched against a target string, it is possible to extract the portions of the target string that matched any particular parenthesized subpattern. (We'll see how these matching substrings are obtained later in the chapter.) For example, suppose we are looking for one or more lowercase letters followed by one or more digits. We might use the pattern /[a-z]+\d+/. But suppose we only really care about the digits at the end of each match. If we put that part of the pattern in parentheses (/[a-z]+(\d+)/), we can extract the digits from any matches we find, as explained later.

A related use of parenthesized subexpressions is to allow us to refer back to a subexpression later in the same regular expression. This is done by following a \ character by a digit or digits. The digits refer to the position of the parenthesized subexpression within the regular expression. For example, \1 refers back to the first subexpression and \3 refers to the third. Note that, because subexpressions can be nested within others, it is the position of the left parenthesis that is counted. In the following regular expression, for example, the nested subexpression ([Ss]cript) is referred to as \2:

/([Jj]ava([Ss]cript)?)\sis\s(fun\w*)/ 

A reference to a previous subexpression of a regular expression does not refer to the pattern for that subexpression, but rather to the text that matched the pattern. Thus, references can be used to enforce a constraint that separate portions of a string contain exactly the same characters. For example, the following regular expression matches zero or more characters within single or double quotes. However, it does not require the opening and closing quotes to match (i.e., both single quotes or both double quotes):

/['"][^'"]*['"]/ 

To require the quotes to match, we can use a reference:

/(['"])[^'"]*\1/ 

The \1 matches whatever the first parenthesized subexpression matched. In this example, it enforces the constraint that the closing quote match the opening quote. This regular expression does not allow single quotes within double-quoted strings or vice versa. It is not legal to use a reference within a character class, so we cannot write:

/(['"])[^\1]*\1/ 

Later in this chapter, we'll see that this kind of reference to a parenthesized sub-expression is a powerful feature of regular expression search-and-replace operations.

In JavaScript 1.5 (but not JavaScript 1.2), it is possible to group items in a regular expression without creating a numbered reference to those items. Instead of simply grouping the items within ( and ), begin the group with (?: and end it with ). Consider the following pattern, for example:

/([Jj]ava(?:[Ss]cript)?)\sis\s(fun\w*)/ 

Here, the subexpression (?:[Ss]cript) is used simply for grouping, so the ? repetition character can be applied to the group. These modified parentheses do not produce a reference, so in this regular expression, \2 refers to the text matched by (fun\w*).

Table 10-4 summarizes the regular expression alternation, grouping, and referencing operators.

Table 10-4. Regular expression alternation, grouping, and reference characters

Character

Meaning

|

Alternation. Match either the subexpressions to the left or the subexpression to the right.

(...)

Grouping. Group items into a single unit that can be used with *, +, ?, |, and so on. Also remember the characters that match this group for use with later references.

(?:...)

Grouping only. Group items into a single unit, but do not remember the characters that match this group.

\n

Match the same characters that were matched when group number n was first matched. Groups are subexpressions within (possibly nested) parentheses. Group numbers are assigned by counting left parentheses from left to right. Groups formed with (?: are not numbered.

10.1.5. Specifying Match Position

We've seen that many elements of a regular expression match a single character in a string. For example, \s matches a single character of whitespace. Other regular expression elements match the positions between characters, instead of actual characters. \b , for example, matches a word boundary -- the boundary between a \w (ASCII word character) and a \W (non-word character), or the boundary between an ASCII word character and the beginning or end of a string.[37] Elements like \b do not specify any characters to be used in a matched string; what they do specify, however, is legal positions at which a match can occur. Sometimes these elements are called regular expression anchors, because they anchor the pattern to a specific position in the search string. The most commonly used anchor elements are ^, which ties the pattern to the beginning of the string, and $, which anchors the pattern to the end of the string.

[37]Except within a character class (square brackets), where \b matches the backspace character.

For example, to match the word "JavaScript" on a line by itself, we could use the regular expression /^JavaScript$/. If we wanted to search for "Java" used as a word by itself (not as a prefix, as it is in "JavaScript"), we might try the pattern /\sJava\s/, which requires a space before and after the word. But there are two problems with this solution. First, it does not match "Java" if that word appears at the beginning or the end of a string, but only if it appears with space on either side. Second, when this pattern does find a match, the matched string it returns has leading and trailing spaces, which is not quite what we want. So instead of matching actual space characters with \s, we instead match (or anchor to) word boundaries with \b. The resulting expression is /\bJava\b/. The element \B anchors the match to a location that is not a word boundary. Thus, the pattern /\B[Ss]cript/ matches "JavaScript" and "postscript", but not "script" or "Scripting".

In JavaScript 1.5 (but not JavaScript 1.2), you can also use arbitrary regular expressions as anchor conditions. If you include an expression within (?= and ) characters, it is a look-ahead assertion, and it specifies that the following characters must match, without actually matching them. For example, to match the name of a common programming language, but only if it is followed by a colon, you could use /[Jj]ava([Ss]cript)?(?=\:)/. This pattern matches the word "JavaScript" in "JavaScript: The Definitive Guide", but it does not match "Java" in "Java in a Nutshell" because it is not followed by a colon.

If you instead introduce an assertion with (?! , it is a negative look-ahead assertion, which specifies that the following characters must not match. For example, /Java(?!Script)([A-Z]\w*)/ matches "Java" followed by a capital letter and any number of additional ASCII word characters, as long as "Java" is not followed by "Script". It matches "JavaBeans" but not "Javanese", and it matches "JavaScrip" but not "JavaScript" or "JavaScripter".

Table 10-5 summarizes regular expression anchors.

Table 10-5. Regular expression anchor characters

Character

Meaning

^

Match the beginning of the string and, in multiline searches, the beginning of a line.

$

Match the end of the string and, in multiline searches, the end of a line.

\b

Match a word boundary. That is, match the position between a \w character and a \W character or between a \w character and the beginning or end of a string. (Note, however, that [\b] matches backspace.)

\B

Match a position that is not a word boundary.

(?=p)

A positive look-ahead assertion. Require that the following characters match the pattern p, but do not include those characters in the match.

(?!p)

A negative look-ahead assertion. Require that the following characters do not match the pattern p.

10.1.6. Flags

There is one final element of regular expression grammar. Regular expression flags specify high-level pattern-matching rules. Unlike the rest of regular expression syntax, flags are specified outside of the / characters; instead of appearing within the slashes, they appear following the second slash. JavaScript 1.2 supports two flags. The i flag specifies that pattern matching should be case-insensitive. The g flag specifies that pattern matching should be global -- that is, all matches within the searched string should be found. Both flags may be combined to perform a global case-insensitive match.

For example, to do a case-insensitive search for the first occurrence of the word "java" (or "Java", "JAVA", etc.), we could use the case-insensitive regular expression /\bjava\b/i. And to find all occurrences of the word in a string, we would add the g flag: /\bjava\b/gi.

JavaScript 1.5 supports an additional flag: m. The m flag performs pattern matching in multiline mode. In this mode, if the string to be searched contains newlines, the ^ and $ anchors match the beginning and end of a line in addition to matching the beginning and end of a string. For example, the pattern /Java$/im matches "java" as well as "Java\nis fun".

Table 10-6 summarizes these regular expression flags. Note that we'll see more about the g flag later in this chapter, when we consider the String and RegExp methods used to actually perform matches.

Table 10-6. Regular expression flags

Character

Meaning

i

Perform case-insensitive matching.

g

Perform a global match. That is, find all matches rather than stopping after the first match.

m

Multiline mode. ^ matches beginning of line or beginning of string, and $ matches end of line or end of string.

10.1.7. Perl RegExp Features Not Supported in JavaScript

We've said that ECMAScript v3 specifies a relatively complete subset of the regular expression facilities from Perl 5. Advanced Perl features that are not supported by ECMAScript include the following:



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.