A regular expression is a string that specifies a pattern that describes a set of matching subject strings. Regular expressions are constructed inductively as follows. Ordinary (non-special) characters match themselves. A concatenation of regular expressions matches the concatenation of corresponding matching subject strings. Regular expressions separated by the character | match strings matched by any. Parentheses can be used for grouping, and results about which substrings of the target string matched which parenthesized subexpression of the regular expression can be returned.
The special characters are those appearing in the following constructions. The special character \ may be confusing, as inside a string delimited by quotation marks ("..."), you type two of them to get one, whereas inside a string delimited by triple slashes (///...///), you type one to get one. Thus regular expressions delimited by triple slashes are more readable.
- . -- match any character except newline
- ^ -- match the beginning of the string or the beginning of a line
- $ -- match the end of the string or the end of a line
- * -- match previous expression 0 or more times
- + -- match previous expression 1 or more times
- ? -- match previous expression 1 or 0 times
- (...) -- subpattern grouping
- | -- match expression to left or expression to right
- {m,n} -- match previous expression at least m and at most n times
- {,n} -- match previous expression at most n times
- {m,} -- match previous expression at least m times
- \i -- match the same string that the i-th parenthesized subpattern matched
- [...] -- match listed characters, ranges, or classes
- [^...] -- match non-listed characters, ranges, or classes
- \b -- match word boundary
- \B -- match within word
- \< -- match beginning of word
- \> -- match end of word
- \w -- match word-constituent character
- \W -- match non-word-constituent character
- \` -- match beginning of string
- \' -- match end of string
There are the following character classes.
- [:alnum:] -- letters and digits
- [:alpha:] -- letters
- [:blank:] -- a space or tab
- [:cntrl:] -- control characters
- [:digit:] -- digits
- [:graph:] -- same as [:print:] except omits space
- [:lower:] -- lowercase letters
- [:print:] -- printable characters
- [:punct:] -- neither control nor alphanumeric characters
- [:space:] -- space, tab, carriage return, newline, vertical tab, and form feed
- [:upper:] -- uppercase letters
- [:xdigit:] -- hexadecimal digits
In order to match one of the special characters itself, precede it with a backslash.
We illustrate the use of regular expressions with regex(String,String).
i1 : regex("d", "1abcddddeF2")
o1 = {(4, 1)}
o1 : List
|
i2 : regex("d*", "1abcddddeF2")
o2 = {(0, 0)}
o2 : List
|
i3 : regex("d+", "1abcddddeF2")
o3 = {(4, 4)}
o3 : List
|
i4 : regex("d+", "1abceF2")
|
i5 : regex("cdd+e", "1abcddddeF2")
o5 = {(3, 6)}
o5 : List
|
i6 : regex("cd(d+)e", "1abcddddeF2")
o6 = {(3, 6), (5, 3)}
o6 : List
|
i7 : regex("[a-z]+", "1abcddddeF2")
o7 = {(1, 8)}
o7 : List
|
i8 : regex("[[:alpha:]]+", "Dog cat cat.")
o8 = {(0, 3)}
o8 : List
|
i9 : regex("([[:alpha:]]+) *\\1","Dog cat cat.")
o9 = {(4, 7), (4, 3)}
o9 : List
|
For complete documentation on regular expressions see the entry for
regex in section 7 of the unix man pages, or read the the GNU
regex manual.