org.apache.oro.text.perl
public final class Perl5Util extends Object implements MatchResult
The objective of the class is to minimize the amount of code a Java programmer using Jakarta-ORO has to write to achieve the same results as Perl by transparently handling regular expression compilation, caching, and matching. A second objective is to use the same Perl pattern matching syntax to ease the task of Perl programmers transitioning to Java (this also reduces the number of parameters to a method). All the state affecting methods are synchronized to avoid the maintenance of explicit locks in multithreaded programs. This philosophy differs from the org.apache.oro.text.regex package, where you are expected to either maintain explicit locks, or more preferably create separate compiler and matcher instances for each thread.
To use this class, first create an instance using the default constructor or initialize the instance with a PatternCache of your choosing using the alternate constructor. The default cache used by Perl5Util is a PatternCacheLRU of capacity GenericPatternCache.DEFAULT_CAPACITY. You may want to create a cache with a different capacity, a different cache replacement policy, or even devise your own PatternCache implementation. The PatternCacheLRU is probably the best general purpose pattern cache, but your specific application may be better served by a different cache replacement policy. You should remember that you can front-load a cache with all the patterns you will be using before initializing a Perl5Util instance, or you can just let Perl5Util fill the cache as you use it.
You might use the class as follows:
Perl5Util util = new Perl5Util(); String line; DataInputStream input; PrintStream output; // Initialization of input and output omitted while((line = input.readLine()) != null) { // First find the line with the string we want to substitute because // it is cheaper than blindly substituting each line. if(util.match("/HREF=\"description1.html\"/")) { line = util.substitute("s/description1\\.html/about1.html/", line); } output.println(line); }
A couple of things to remember when using this class are that the
match()
methods have the same meaning as
Perl5Matcher.contains()
and =~ m/pattern/
in Perl. The methods are named match
to more closely associate them with Perl and to differentiate them
from Perl5Matcher.matches()
.
A further thing to keep in mind is that the
MalformedPerl5PatternException class is derived from
RuntimeException which means you DON'T have to catch it. The reasoning
behind this is that you will detect your regular expression mistakes
as you write and debug your program when a MalformedPerl5PatternException
is thrown during a test run. However, we STRONGLY recommend that you
ALWAYS catch MalformedPerl5PatternException whenever you deal with a
DYNAMICALLY created pattern. Relying on a fatal
MalformedPerl5PatternException being thrown to detect errors while
debugging is only useful for dealing with static patterns, that is, actual
pregenerated strings present in your program. Patterns created from user
input or some other dynamic method CANNOT be relied upon to be correct
and MUST be handled by catching MalformedPerl5PatternException for your
programs to be robust.
Finally, as a convenience Perl5Util implements
the MatchResult
interface.
The methods are merely wrappers which call the corresponding method of
the last MatchResult
found (which can be accessed with getMatch) by a match or
substitution (or even a split, but this isn't particularly useful).
At the moment, the
MatchResult
returned
by getMatch is not stored in a thread-local variable. Therefore
concurrent calls to getMatch will produce unpredictable
results. So if your concurrent program requires the match results,
you must protect the matching and the result retrieval in a critical
section. If you do not need match results, you don't need to do anything
special. If you feel the J2SE implementation of getMatch
should use a thread-local variable and obviate the need for a critical
section, please express your views on the oro-dev mailing list.
Since: 1.0
Version: 2.0.8
See Also: MalformedPerl5PatternException PatternCache PatternCacheLRU MatchResult
Field Summary | |
---|---|
static int | SPLIT_ALL
A constant passed to the split() methods indicating
that all occurrences of a pattern should be used to split a string. |
Constructor Summary | |
---|---|
Perl5Util(PatternCache cache)
A secondary constructor for Perl5Util. | |
Perl5Util()
Default constructor for Perl5Util. |
Method Summary | |
---|---|
int | begin(int group)
Returns the begin offset of the subgroup of the last match found
relative the beginning of the match.
|
int | beginOffset(int group)
Returns an offset marking the beginning of the last pattern match
found relative to the beginning of the input from which the match
was extracted.
|
int | end(int group)
Returns the end offset of the subgroup of the last match found
relative the beginning of the match.
|
int | endOffset(int group)
Returns an offset marking the end of the last pattern match found
relative to the beginning of the input from which the match was
extracted.
|
MatchResult | getMatch()
Returns the last match found by a call to a match(), substitute(), or
split() method. |
String | group(int group)
Returns the contents of the parenthesized subgroups of the last match
found according to the behavior dictated by the MatchResult interface.
|
int | groups() |
int | length()
Returns the length of the last match found.
|
boolean | match(String pattern, char[] input)
Searches for the first pattern match somewhere in a character array
taking a pattern specified in Perl5 native format:
The[m]/pattern/[i][m][s][x] m prefix is optional and the meaning of the optional
trailing options are:
|
boolean | match(String pattern, String input)
Searches for the first pattern match in a String taking
a pattern specified in Perl5 native format:
The[m]/pattern/[i][m][s][x] m prefix is optional and the meaning of the optional
trailing options are:
|
boolean | match(String pattern, PatternMatcherInput input)
Searches for the next pattern match somewhere in a
org.apache.oro.text.regex.PatternMatcherInput instance, taking
a pattern specified in Perl5 native format:
The[m]/pattern/[i][m][s][x] m prefix is optional and the meaning of the optional
trailing options are:
|
String | postMatch()
Returns the part of the input following the last match found.
|
char[] | postMatchCharArray()
Returns the part of the input following the last match found as a char
array. |
String | preMatch()
Returns the part of the input preceding the last match found.
|
char[] | preMatchCharArray()
Returns the part of the input preceding the last match found as a
char array. |
void | split(Collection results, String pattern, String input, int limit)
Splits a String into strings that are appended to a List, but no more
than a specified limit. |
void | split(Collection results, String pattern, String input)
This method is identical to calling:
split(results, pattern, input, SPLIT_ALL); |
void | split(Collection results, String input)
Splits input in the default Perl manner, splitting on all whitespace.
|
Vector | split(String pattern, String input, int limit)
Splits a String into strings contained in a Vector of size no greater
than a specified limit. |
Vector | split(String pattern, String input)
This method is identical to calling:
split(pattern, input, SPLIT_ALL); |
Vector | split(String input)
Splits input in the default Perl manner, splitting on all whitespace.
|
int | substitute(StringBuffer result, String expression, String input)
Substitutes a pattern in a given input with a replacement string.
|
String | substitute(String expression, String input)
Substitutes a pattern in a given input with a replacement string.
|
String | toString()
Returns the same as group(0).
|
split()
methods indicating
that all occurrences of a pattern should be used to split a string.// We know we're going to use close to 50 expressions a whole lot, so // we create a cache of the proper size. util = new Perl5Util(new PatternCacheLRU(50));or
// We're only going to use a few expressions and know that second-chance // fifo is best suited to the order in which we are using the patterns. util = new Perl5Util(new PatternCacheFIFO2(10));
Parameters: group The pattern subgroup.
Returns: The offset into group 0 of the first token in the indicated pattern subgroup. If a group was never matched or does not exist, returns -1. Be aware that a group that matches the null string at the end of a match will have an offset equal to the length of the string, so you shouldn't blindly use the offset to index an array or String.
Parameters: group The pattern subgroup.
Returns: The offset of the first token in the indicated pattern subgroup. If a group was never matched or does not exist, returns -1.
Parameters: group The pattern subgroup.
Returns: Returns one plus the offset into group 0 of the last token in the indicated pattern subgroup. If a group was never matched or does not exist, returns -1. A group matching the null string will return its start offset.
Parameters: group The pattern subgroup.
Returns: Returns one plus the offset of the last token in the indicated pattern subgroup. If a group was never matched or does not exist, returns -1. A group matching the null string will return its start offset.
Returns: The org.apache.oro.text.regex.MatchResult instance containing the last match found.
Parameters: group The pattern subgroup to return.
Returns: A string containing the indicated pattern subgroup. Group 0 always refers to the entire match. If a group was never matched, it returns null. This is not to be confused with a group matching the null string, which will return a String of length 0.
Returns: The number of groups contained in the last match found. This number includes the 0th group. In other words, the result refers to the number of parenthesized subgroups plus the entire match itself.
Returns: The length of the last match found.
The[m]/pattern/[i][m][s][x]
m
prefix is optional and the meaning of the optional
trailing options are:
If the input contains the pattern, the org.apache.oro.text.regex.MatchResult can be obtained by calling getMatch. However, Perl5Util implements the MatchResult interface as a wrapper around the last MatchResult found, so you can call its methods to access match information.
Parameters: pattern The pattern to search for. input The char[] input to search.
Returns: True if the input contains the pattern, false otherwise.
Throws: MalformedPerl5PatternException If there is an error in the pattern. You are not forced to catch this exception because it is derived from RuntimeException.
The[m]/pattern/[i][m][s][x]
m
prefix is optional and the meaning of the optional
trailing options are:
If the input contains the pattern, the
MatchResult
can be obtained by calling getMatch.
However, Perl5Util implements the MatchResult interface as a wrapper
around the last MatchResult found, so you can call its methods to
access match information.
Parameters: pattern The pattern to search for. input The String input to search.
Returns: True if the input contains the pattern, false otherwise.
Throws: MalformedPerl5PatternException If there is an error in the pattern. You are not forced to catch this exception because it is derived from RuntimeException.
The[m]/pattern/[i][m][s][x]
m
prefix is optional and the meaning of the optional
trailing options are:
If the input contains the pattern, the
MatchResult
can be obtained by calling getMatch.
However, Perl5Util implements the MatchResult interface as a wrapper
around the last MatchResult found, so you can call its methods to
access match information.
After the call to this method, the PatternMatcherInput current offset
is advanced to the end of the match, so you can use it to repeatedly
search for expressions in the entire input using a while loop as
explained in the
PatternMatcherInput
documentation.
Parameters: pattern The pattern to search for. input The PatternMatcherInput to search.
Returns: True if the input contains the pattern, false otherwise.
Throws: MalformedPerl5PatternException If there is an error in the pattern. You are not forced to catch this exception because it is derived from RuntimeException.
Returns: The part of the input following the last match found.
Returns: The part of the input following the last match found as a char[]. If the result is of zero length, returns null instead of a zero length array.
Returns: The part of the input following the last match found.
Returns: The part of the input preceding the last match found as a char[]. If the result is of zero length, returns null instead of a zero length array.
The[m]/pattern/[i][m][s][x]
m
prefix is optional and the meaning of the optional
trailing options are:
The limit parameter causes the string to be split on at most the first limit - 1 number of pattern occurences.
Of special note is that this split method performs EXACTLY the same as the Perl split() function. In other words, if the split pattern contains parentheses, additional Vector elements are created from each of the matching subgroups in the pattern. Using an example similar to the one from the Camel book:
produces the Vector containing:split(list, "/([,-])/", "8-12,15,18")
Furthermore, the following Perl behavior is observed: "leading empty fields are preserved, and empty trailing one are deleted." This has the effect that a split on a zero length string returns an empty list. The{ "8", "-", "12", ",", "15", ",", "18" }
Util.split()
method
does NOT implement these behaviors because it is intended to
be a general self-consistent and predictable split function usable
with Pattern instances other than Perl5Pattern.
Parameters: results
A Collection
to which the substrings of the input
that occur between the regular expression delimiter occurences
are appended. The input will not be split into any more substrings
than the specified
limit. A way of thinking of this is that only the first
limit - 1
matches of the delimiting regular expression will be used to split the
input. The Collection must support the
addAll(Collection)
operation. pattern The regular expression to use as a split delimiter. input The String to split. limit The limit on the size of the returned Vector
.
Values <= 0 produce the same behavior as the SPLIT_ALL constant which
causes the limit to be ignored and splits to be performed on all
occurrences of the pattern. You should use the SPLIT_ALL constant
to achieve this behavior instead of relying on the default behavior
associated with non-positive limit values.
Throws: MalformedPerl5PatternException If there is an error in the expression. You are not forced to catch this exception because it is derived from RuntimeException.
split(results, pattern, input, SPLIT_ALL);
split(results, "/\\s+/", input);
Deprecated: Use Perl5Util instead.
Splits a String into strings contained in a Vector of size no greater than a specified limit. The String is split using a regular expression as the delimiter. The regular expression is a pattern specified in Perl5 native format:The[m]/pattern/[i][m][s][x]
m
prefix is optional and the meaning of the optional
trailing options are:
The limit parameter causes the string to be split on at most the first limit - 1 number of pattern occurences.
Of special note is that this split method performs EXACTLY the same as the Perl split() function. In other words, if the split pattern contains parentheses, additional Vector elements are created from each of the matching subgroups in the pattern. Using an example similar to the one from the Camel book:
produces the Vector containing:split("/([,-])/", "8-12,15,18")
The{ "8", "-", "12", ",", "15", ",", "18" }
Util.split()
method
does NOT implement this particular behavior because it is intended to
be usable with Pattern instances other than Perl5Pattern.
Parameters: pattern The regular expression to use as a split delimiter. input The String to split. limit The limit on the size of the returned Vector
.
Values <= 0 produce the same behavior as the SPLIT_ALL constant which
causes the limit to be ignored and splits to be performed on all
occurrences of the pattern. You should use the SPLIT_ALL constant
to achieve this behavior instead of relying on the default behavior
associated with non-positive limit values.
Returns: A Vector
containing the substrings of the input
that occur between the regular expression delimiter occurences. The
input will not be split into any more substrings than the specified
limit. A way of thinking of this is that only the first
limit - 1
matches of the delimiting regular expression will be used to split the
input.
Throws: MalformedPerl5PatternException If there is an error in the expression. You are not forced to catch this exception because it is derived from RuntimeException.
Deprecated: Use Perl5Util instead.
This method is identical to calling:split(pattern, input, SPLIT_ALL);
Deprecated: Use Perl5Util instead.
Splits input in the default Perl manner, splitting on all whitespace. This method is identical to calling:split("/\\s+/", input);
Thes/pattern/replacement/[g][i][m][o][s][x]
s
prefix is mandatory and the meaning of the optional
trailing options are:
Util.substitute()
.
The default is to compute each interpolation independently.
See
Util.substitute()
and Perl5Substitution
for more details on variable interpolation in
substitutions.
when you could more easily write:numSubs = util.substitute(result, "s/foo\\/bar/goo\\/\\/baz/", input);
where the hashmarks are used instead of slashes.numSubs = util.substitute(result, "s#foo/bar#goo//baz#", input);
There is a special case of backslashing that you need to pay attention to. As demonstrated above, to denote a delimiter in the substituted string it must be backslashed. However, this can be a problem when you want to denote a backslash at the end of the substituted string. As of PerlTools 1.3, a new means of handling this situation has been implemented. In previous versions, the behavior was that
"... a double backslash (quadrupled in the Java String) always represents two backslashes unless the second backslash is followed by the delimiter, in which case it represents a single backslash."
The new behavior is that a backslash is always a backslash in the substitution portion of the expression unless it is used to escape a delimiter. A backslash is considered to escape a delimiter if an even number of contiguous backslashes preceed the backslash and the delimiter following the backslash is not the FINAL delimiter in the expression. Therefore, backslashes preceding final delimiters are never considered to escape the delimiter. The following, which used to be an invalid expression and require a special-case extra backslash, will now replace all instances of / with \:
numSubs = util.substitute(result, "s#/#\\#g", input);
Parameters: result The StringBuffer in which to store the result of the substitutions. The buffer is only appended to. expression The Perl5 substitution regular expression. input The input on which to perform substitutions.
Returns: The number of substitutions made.
Throws: MalformedPerl5PatternException If there is an error in the expression. You are not forced to catch this exception because it is derived from RuntimeException.
Since: 2.0.6
String result; StringBuffer buffer = new StringBuffer(); perl.substitute(buffer, expression, input); result = buffer.toString();
Parameters: expression The Perl5 substitution regular expression. input The input on which to perform substitutions.
Returns: The input as a String after substitutions have been performed.
Throws: MalformedPerl5PatternException If there is an error in the expression. You are not forced to catch this exception because it is derived from RuntimeException.
Since: 1.0
See Also: Perl5Util
Returns: A string containing the entire match.