com.ibm.icu.text

Class StringSearch

public final class StringSearch extends SearchIterator

StringSearch is the concrete subclass of SearchIterator that provides language-sensitive text searching based on the comparison rules defined in a RuleBasedCollator object.

StringSearch uses a version of the fast Boyer-Moore search algorithm that has been adapted to work with the large character set of Unicode. Refer to "Efficient Text Searching in Java", published in the Java Report on February, 1999, for further information on the algorithm.

Users are also strongly encouraged to read the section on String Search and Collation in the user guide before attempting to use this class.

String searching gets alittle complicated when accents are encountered at match boundaries. If a match is found and it has preceding or trailing accents not part of the match, the result returned will include the preceding accents up to the first base character, if the pattern searched for starts an accent. Likewise, if the pattern ends with an accent, all trailing accents up to the first base character will be included in the result.

For example, if a match is found in target text "a\u0325\u0300" for the pattern "a\u0325", the result returned by StringSearch will be the index 0 and length 3 <0, 3>. If a match is found in the target "a\u0325\u0300" for the pattern "\u0300", then the result will be index 1 and length 2 <1, 2>.

In the case where the decomposition mode is on for the RuleBasedCollator, all matches that starts or ends with an accent will have its results include preceding or following accents respectively. For example, if pattern "a" is looked for in the target text "á\u0325", the result will be index 0 and length 2 <0, 2>.

The StringSearch class provides two options to handle accent matching described below:

Let S' be the sub-string of a text string S between the offsets start and end <start, end>.
A pattern string P matches a text string S at the offsets <start, length>
if

 
 option 1. P matches some canonical equivalent string of S'. Suppose the 
           RuleBasedCollator used for searching has a collation strength of 
           TERTIARY, all accents are non-ignorable. If the pattern 
           "a\u0300" is searched in the target text 
           "a\u0325\u0300", 
           a match will be found, since the target text is canonically 
           equivalent to "a\u0300\u0325"
 option 2. P matches S' and if P starts or ends with a combining mark, 
           there exists no non-ignorable combining mark before or after S' 
           in S respectively. Following the example above, the pattern 
           "a\u0300" will not find a match in "a\u0325\u0300", 
           since
           there exists a non-ignorable accent '\u0325' in the middle of 
           'a' and '\u0300'. Even with a target text of 
           "a\u0300\u0325" a match will not be found because of the 
           non-ignorable trailing accent \u0325.
 
Option 2. will be the default mode for dealing with boundary accents unless specified via the API setCanonical(boolean). One restriction is to be noted for option 1. Currently there are no composite characters that consists of a character with combining class > 0 before a character with combining class == 0. However, if such a character exists in the future, the StringSearch may not work correctly with option 1 when such characters are encountered.

SearchIterator provides APIs to specify the starting position within the text string to be searched, e.g. setIndex, preceding and following. Since the starting position will be set as it is specified, please take note that there are some dangerous positions which the search may render incorrect results:

Though collator attributes will be taken into consideration while performing matches, there are no APIs provided in StringSearch for setting and getting the attributes. These attributes can be set by getting the collator from getCollator and using the APIs in com.ibm.icu.text.Collator. To update StringSearch to the new collator attributes, reset() or setCollator(RuleBasedCollator) has to be called.

Consult the String Search user guide and the SearchIterator documentation for more information and examples of use.

This class is not subclassable

Author: Laura Werner, synwee

See Also: SearchIterator RuleBasedCollator

UNKNOWN: ICU 2.0

Constructor Summary
StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator, BreakIterator breakiter)
Initializes the iterator to use the language-specific rules defined in the argument collator to search for argument pattern in the argument target text.
StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator)
Initializes the iterator to use the language-specific rules defined in the argument collator to search for argument pattern in the argument target text.
StringSearch(String pattern, CharacterIterator target, Locale locale)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the argument locale to search for argument pattern in the argument target text.
StringSearch(String pattern, CharacterIterator target, ULocale locale)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the argument locale to search for argument pattern in the argument target text.
StringSearch(String pattern, String target)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the default locale to search for argument pattern in the argument target text.
Method Summary
RuleBasedCollatorgetCollator()

Gets the RuleBasedCollator used for the language rules.

intgetIndex()
Return the index in the target text where the iterator is currently positioned at.
StringgetPattern()
Returns the pattern for which StringSearch is searching for.
protected inthandleNext(int start)

Concrete method to provide the mechanism for finding the next forwards match in the target text.

protected inthandlePrevious(int start)

Concrete method to provide the mechanism for finding the next backwards match in the target text.

booleanisCanonical()
Determines whether canonical matches (option 1, as described in the class documentation) is set.
voidreset()

Resets the search iteration.

voidsetCanonical(boolean allowCanonical)

Set the canonical match mode.

voidsetCollator(RuleBasedCollator collator)

Sets the RuleBasedCollator to be used for language-specific searching.

voidsetIndex(int position)

Sets the position in the target text which the next search will start from to the argument.

voidsetPattern(String pattern)

Set the pattern to search for.

voidsetTarget(CharacterIterator text)
Set the target text to be searched.

Constructor Detail

StringSearch

public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator, BreakIterator breakiter)
Initializes the iterator to use the language-specific rules defined in the argument collator to search for argument pattern in the argument target text. The argument breakiter is used to define logical matches. See super class documentation for more details on the use of the target text and BreakIterator.

Parameters: pattern text to look for. target target text to search for pattern. collator RuleBasedCollator that defines the language rules breakiter A BreakIterator that is used to determine the boundaries of a logical match. This argument can be null.

Throws: IllegalArgumentException thrown when argument target is null, or of length 0

See Also: BreakIterator RuleBasedCollator SearchIterator

UNKNOWN: ICU 2.0

StringSearch

public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator)
Initializes the iterator to use the language-specific rules defined in the argument collator to search for argument pattern in the argument target text. No BreakIterators are set to test for logical matches.

Parameters: pattern text to look for. target target text to search for pattern. collator RuleBasedCollator that defines the language rules

Throws: IllegalArgumentException thrown when argument target is null, or of length 0

See Also: RuleBasedCollator SearchIterator

UNKNOWN: ICU 2.0

StringSearch

public StringSearch(String pattern, CharacterIterator target, Locale locale)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the argument locale to search for argument pattern in the argument target text. See super class documentation for more details on the use of the target text and BreakIterator.

Parameters: pattern text to look for. target target text to search for pattern. locale locale to use for language and break iterator rules

Throws: IllegalArgumentException thrown when argument target is null, or of length 0. ClassCastException thrown if the collator for the specified locale is not a RuleBasedCollator.

See Also: BreakIterator RuleBasedCollator SearchIterator

UNKNOWN: ICU 2.0

StringSearch

public StringSearch(String pattern, CharacterIterator target, ULocale locale)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the argument locale to search for argument pattern in the argument target text. See super class documentation for more details on the use of the target text and BreakIterator.

Parameters: pattern text to look for. target target text to search for pattern. locale ulocale to use for language and break iterator rules

Throws: IllegalArgumentException thrown when argument target is null, or of length 0. ClassCastException thrown if the collator for the specified locale is not a RuleBasedCollator.

See Also: BreakIterator RuleBasedCollator SearchIterator

UNKNOWN: ICU 3.2 This API might change or be removed in a future release.

StringSearch

public StringSearch(String pattern, String target)
Initializes the iterator to use the language-specific rules and break iterator rules defined in the default locale to search for argument pattern in the argument target text. See super class documentation for more details on the use of the target text and BreakIterator.

Parameters: pattern text to look for. target target text to search for pattern.

Throws: IllegalArgumentException thrown when argument target is null, or of length 0. ClassCastException thrown if the collator for the default locale is not a RuleBasedCollator.

See Also: BreakIterator RuleBasedCollator SearchIterator

UNKNOWN: ICU 2.0

Method Detail

getCollator

public RuleBasedCollator getCollator()

Gets the RuleBasedCollator used for the language rules.

Since StringSearch depends on the returned RuleBasedCollator, any changes to the RuleBasedCollator result should follow with a call to either StringSearch.reset() or StringSearch.setCollator(RuleBasedCollator) to ensure the correct search behaviour.

Returns: RuleBasedCollator used by this StringSearch

See Also: RuleBasedCollator StringSearch

UNKNOWN: ICU 2.0

getIndex

public int getIndex()
Return the index in the target text where the iterator is currently positioned at. If the iteration has gone past the end of the target text or past the beginning for a backwards search, StringSearch is returned.

Returns: index in the target text where the iterator is currently positioned at

UNKNOWN: ICU 2.8

getPattern

public String getPattern()
Returns the pattern for which StringSearch is searching for.

Returns: the pattern searched for

UNKNOWN: ICU 2.0

handleNext

protected int handleNext(int start)

Concrete method to provide the mechanism for finding the next forwards match in the target text. See super class documentation for its use.

Parameters: start index in the target text at which the forwards search should begin.

Returns: the starting index of the next forwards match if found, DONE otherwise

See Also: StringSearch StringSearch

UNKNOWN: ICU 2.8

handlePrevious

protected int handlePrevious(int start)

Concrete method to provide the mechanism for finding the next backwards match in the target text. See super class documentation for its use.

Parameters: start index in the target text at which the backwards search should begin.

Returns: the starting index of the next backwards match if found, DONE otherwise

See Also: StringSearch StringSearch

UNKNOWN: ICU 2.8

isCanonical

public boolean isCanonical()
Determines whether canonical matches (option 1, as described in the class documentation) is set. See setCanonical(boolean) for more information.

Returns: true if canonical matches is set, false otherwise

See Also: StringSearch

UNKNOWN: ICU 2.8

reset

public void reset()

Resets the search iteration. All properties will be reset to the default value.

Search will begin at the start of the target text if a forward iteration is initiated before a backwards iteration. Otherwise if a backwards iteration is initiated before a forwards iteration, the search will begin at the end of the target text.

Canonical match option will be reset to false, ie an exact match.

UNKNOWN: ICU 2.8

setCanonical

public void setCanonical(boolean allowCanonical)

Set the canonical match mode. See class documentation for details. The default setting for this property is false.

Parameters: allowCanonical flag indicator if canonical matches are allowed

See Also: StringSearch

UNKNOWN: ICU 2.8

setCollator

public void setCollator(RuleBasedCollator collator)

Sets the RuleBasedCollator to be used for language-specific searching.

This method causes internal data such as Boyer-Moore shift tables to be recalculated, but the iterator's position is unchanged.

Parameters: collator to use for this StringSearch

Throws: IllegalArgumentException thrown when collator is null

See Also: StringSearch

UNKNOWN: ICU 2.0

setIndex

public void setIndex(int position)

Sets the position in the target text which the next search will start from to the argument. This method clears all previous states.

This method takes the argument position and sets the position in the target text accordingly, without checking if position is pointing to a valid starting point to begin searching.

Search positions that may render incorrect results are highlighted in the class documentation.

Parameters: position index to start next search from.

Throws: IndexOutOfBoundsException thrown if argument position is out of the target text range.

See Also: StringSearch

UNKNOWN: ICU 2.8

setPattern

public void setPattern(String pattern)

Set the pattern to search for.

This method causes internal data such as Boyer-Moore shift tables to be recalculated, but the iterator's position is unchanged.

Parameters: pattern for searching

Throws: IllegalArgumentException thrown if pattern is null or of length 0

See Also: StringSearch

UNKNOWN: ICU 2.0

setTarget

public void setTarget(CharacterIterator text)
Set the target text to be searched. Text iteration will hence begin at the start of the text string. This method is useful if you want to re-use an iterator to search within a different body of text.

Parameters: text new text iterator to look for match,

Throws: IllegalArgumentException thrown when text is null or has 0 length

See Also: StringSearch

UNKNOWN: ICU 2.8

Copyright (c) 2007 IBM Corporation and others.