The regular expressions (or regex for short) used in searches and segmentation rules are those supported by Java. If you need more specific information, please consult http://java.sun.com/j2se/1.5/docs/api/java/util/regex/Pattern.html. See additional references and examples below.
The construct... |
...matches the following: |
Flags |
|
|
Enables case-insensitive matching (by default, the pattern is case-sensitive). |
Characters |
|
|
The character x, except the following... |
|
The character with
hexadecimal value |
|
The tab character ( |
|
The newline (line feed) character ( |
|
The carriage-return character ( |
\f |
The form-feed character ('\u000C') |
\a |
The alert (bell) character ('\u0007') |
\e |
The escape character ('\u001B') |
\cx |
The control character corresponding to x |
\0n |
The character with octal value 0n (0 <= n <= 7) |
\0nn |
The character with octal value 0nn (0 <= n <= 7) |
\0mnn |
The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) |
\xhh |
The character with hexadecimal value 0xhh |
Quotation |
|
|
Nothing, but quotes the following character. This is required if
you would like to enter of the meta characters
|
|
For example, this is the backslash character |
|
Nothing, but quotes all characters until |
|
Nothing, but ends quoting started by \Q |
Classes for Unicode blocks and categories |
|
|
A character in the Greek block (simple block) |
|
An uppercase letter (simple category) |
|
A currency symbol |
|
Any character except one in the Greek block (negation) |
|
Any letter except an uppercase letter (subtraction) |
Character classes |
|
|
|
|
Any character except |
|
|
Predefined character classes |
|
|
Any character (except for line terminators) |
|
A digit: |
|
A non-digit: |
|
A whitespace character: |
|
A non-whitespace character: |
|
A word character: |
|
A non-word character: |
Boundary matchers |
|
|
The beginning of a line |
|
The end of a line |
|
A word boundary |
|
A non-word boundary |
Greedy quantifiers |
|
These will match as much as they can. For example, |
|
X |
X, once or not at all |
X |
X, zero or more times |
X |
X, one or more times |
Reluctant (non-greedy) quantifiers |
|
These will match as little as they can. For example,
|
|
X |
X, once or not at all |
X |
X, zero or more times |
X |
X, one or more times |
Logical operators |
|
XY |
X followed by Y |
X |
Either X or Y |
|
XY as a single group |
Regular expression | Finds the following: |
(\b\w+\b)\s\1\b |
double words |
[\.,]\s*[\.,]+ | t commas and periods mix-up |
\. \s$ | extra blanks, following the period at the end of the line |
\s+a\s+[aeiou] | English: words, starting on vowels, should be preceded by "an", not "a" |
\s+an\s+[^aeiou] | English: the same check as above, but for consonants ("a", not "an") |
\s\s+ | more than one space |
\.[A-Z] | space missing between a period and the start of a new sentence |
Legal notices | Home | Index of contents |