Unix Power ToolsUnix Power ToolsSearch this book

32.17. Just What Does a Regular Expression Match?

One of the toughest things to learn about regular expressions is just what they do match. The problem is that a regular expression tends to find the longest possible match -- which can be more than you want.

Figure Go to http://examples.oreilly.com/upt3 for more information on: showmatch

Here's a simple script called showmatch that is useful for testing regular expressions, when writing sed scripts, etc. Given a regular expression and a filename, it finds lines in the file matching that expression, just like grep, but it uses a row of carets (^^^^) to highlight the portion of the line that was actually matched. Depending on your system, you may need to call nawk instead of awk; most modern systems have an awk that supports the syntax introduced by nawk, however.

#! /bin/sh
# showmatch - mark string that matches pattern
pattern=$1; shift
awk 'match($0,pattern) > 0 {
    s = substr($0,1,RSTART-1)
    m = substr($0,1,RLENGTH)
    gsub (/[^\b- ]/, " ", s)
    gsub (/./,       "^", m)
    printf "%s\n%s%s\n", $0, s, m
}' pattern="$pattern" $*

For example:

% showmatch 'CD-...' mbox
and CD-ROM publishing. We have recognized
    ^^^^^^
that documentation will be shipped on CD-ROM; however,
                                      ^^^^^^

Figure Go to http://examples.oreilly.com/upt3 for more information on: xgrep

xgrep is a related script that simply retrieves only the matched text. This allows you to extract patterned data from a file. For example, you could extract only the numbers from a table containing both text and numbers. It's also great for counting the number of occurrences of some pattern in your file, as shown below. Just be sure that your expression matches only what you want. If you aren't sure, leave off the wc command and glance at the output. For example, the regular expression [0-9]* will match numbers like 3.2 twice: once for the 3 and again for the 2! You want to include a dot (.) and/or comma (,), depending on how your numbers are written. For example: [0-9][.0-9]* matches a leading digit, possibly followed by more dots and digits.

NOTE: Remember that an expression like [0-9]* will match zero numbers (because * means "zero or more of the preceding character"). That expression can make xgrep run for a very long time! The following expression, which matches one or more digits, is probably what you want instead:

xgrep "[0-9][0-9]*" files | wc -l

The xgrep shell script runs the sed commands below, replacing $re with the regular expression from the command line and $x with a CTRL-b character (which is used as a delimiter). We've shown the sed commands numbered, like 5>; these are only for reference and aren't part of the script:

1> \$x$re$x!d
2> s//$x&$x/g
3> s/[^$x]*$x//
4> s/$x[^$x]*$x/\
   /g
5> s/$x.*//

Command 1 deletes all input lines that don't contain a match. On the remaining lines (which do match), command 2 surrounds the matching text with CTRL-b delimiter characters. Command 3 removes all characters (including the first delimiter) before the first match on a line. When there's more than one match on a line, command 4 breaks the multiple matches onto separate lines. Command 5 removes the last delimiter, and any text after it, from every output line.

Greg Ubben revised showmatch and wrote xgrep.

--JP, DD, andTOR



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.