com.ibm.icu.lang

Class UCharacter

public final class UCharacter extends Object implements ECharacterCategory, ECharacterDirection

The UCharacter class provides extensions to the java.lang.Character class. These extensions provide support for more Unicode properties and together with the UTF16 class, provide support for supplementary characters (those with code points above U+FFFF). Each ICU release supports the latest version of Unicode available at that time.

Code points are represented in these API using ints. While it would be more convenient in Java to have a separate primitive datatype for them, ints suffice in the meantime.

To use this class please add the jar file name icu4j.jar to the class path, since it contains data files which supply the information used by this file.
E.g. In Windows
set CLASSPATH=%CLASSPATH%;$JAR_FILE_PATH/ucharacter.jar.
Otherwise, another method would be to copy the files uprops.dat and unames.icu from the icu4j source subdirectory $ICU4J_SRC/src/com.ibm.icu.impl.data to your class directory $ICU4J_CLASS/com.ibm.icu.impl.data.

Aside from the additions for UTF-16 support, and the updated Unicode properties, the main differences between UCharacter and Character are:

Further detail differences can be determined from the program com.ibm.icu.dev.test.lang.UCharacterCompare

In addition to Java compatibility functions, which calculate derived properties, this API provides low-level access to the Unicode Character Database.

Unicode assigns each code point (not just assigned character) values for many properties. Most of them are simple boolean flags, or constants from a small enumerated list. For some properties, values are strings or other relatively more complex types.

For more information see "About the Unicode Character Database" (http://www.unicode.org/ucd/) and the ICU User Guide chapter on Properties (http://icu.sourceforge.net/userguide/properties.html).

There are also functions that provide easy migration from C/POSIX functions like isblank(). Their use is generally discouraged because the C/POSIX standards do not define their semantics beyond the ASCII range, which means that different implementations exhibit very different behavior. Instead, Unicode properties should be used directly.

There are also only a few, broad C/POSIX character classes, and they tend to be used for conflicting purposes. For example, the "isalpha()" class is sometimes used to determine word boundaries, while a more sophisticated approach would at least distinguish initial letters from continuation characters (the latter including combining marks). (In ICU, BreakIterator is the most sophisticated API for word boundaries.) Another example: There is no "istitle()" class for titlecase characters.

ICU 3.4 and later provides API access for all twelve C/POSIX character classes. ICU implements them according to the Standard Recommendations in Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions (http://www.unicode.org/reports/tr18/#Compatibility_Properties).

API access for C/POSIX character classes is as follows: - alpha: isUAlphabetic(c) or hasBinaryProperty(c, UProperty.ALPHABETIC) - lower: isULowercase(c) or hasBinaryProperty(c, UProperty.LOWERCASE) - upper: isUUppercase(c) or hasBinaryProperty(c, UProperty.UPPERCASE) - punct: ((1<

The C/POSIX character classes are also available in UnicodeSet patterns, using patterns like [:graph:] or \p{graph}.

Note: There are several ICU (and Java) whitespace functions. Comparison: - isUWhiteSpace=UCHAR_WHITE_SPACE: Unicode White_Space property; most of general categories "Z" (separators) + most whitespace ISO controls (including no-break spaces, but excluding IS1..IS4 and ZWSP) - isWhitespace: Java isWhitespace; Z + whitespace ISO controls but excluding no-break spaces - isSpaceChar: just Z (including no-break spaces)

This class is not subclassable

Author: Syn Wee Quek

See Also: com.ibm.icu.lang.UCharacterEnums

UNKNOWN: ICU 2.1

Nested Class Summary
static interfaceUCharacter.DecompositionType
Decomposition Type constants.
static interfaceUCharacter.EastAsianWidth
East Asian Width constants.
static interfaceUCharacter.GraphemeClusterBreak
Grapheme Cluster Break constants.
static interfaceUCharacter.HangulSyllableType
Hangul Syllable Type constants.
static interfaceUCharacter.JoiningGroup
Joining Group constants.
static interfaceUCharacter.JoiningType
Joining Type constants.
static interfaceUCharacter.LineBreak
Line Break constants.
static interfaceUCharacter.NumericType
Numeric Type constants.
static interfaceUCharacter.SentenceBreak
Sentence Break constants.
static classUCharacter.UnicodeBlock
A family of character subsets representing the character blocks in the Unicode specification, generated from Unicode Data file Blocks.txt.
static interfaceUCharacter.WordBreak
Word Break constants.
Field Summary
static intFOLD_CASE_DEFAULT
Option value for case folding: use default mappings defined in CaseFolding.txt.
static intFOLD_CASE_EXCLUDE_SPECIAL_I
Option value for case folding: exclude the mappings for dotted I and dotless i marked with 'I' in CaseFolding.txt.
static intMAX_CODE_POINT
Cover the JDK 1.5 API, for convenience.
static charMAX_HIGH_SURROGATE
Cover the JDK 1.5 API, for convenience.
static charMAX_LOW_SURROGATE
Cover the JDK 1.5 API, for convenience.
static intMAX_RADIX
Compatibility constant for Java Character's MAX_RADIX.
static charMAX_SURROGATE
Cover the JDK 1.5 API, for convenience.
static intMAX_VALUE
The highest Unicode code point value (scalar value) according to the Unicode Standard.
static intMIN_CODE_POINT
Cover the JDK 1.5 API, for convenience.
static charMIN_HIGH_SURROGATE
Cover the JDK 1.5 API, for convenience.
static charMIN_LOW_SURROGATE
Cover the JDK 1.5 API, for convenience.
static intMIN_RADIX
Compatibility constant for Java Character's MIN_RADIX.
static intMIN_SUPPLEMENTARY_CODE_POINT
Cover the JDK 1.5 API, for convenience.
static charMIN_SURROGATE
Cover the JDK 1.5 API, for convenience.
static intMIN_VALUE
The lowest Unicode code point value.
static doubleNO_NUMERIC_VALUE
Special value that is returned by getUnicodeNumericValue(int) when no numeric value is defined for a code point.
static intREPLACEMENT_CHAR
Unicode value used when translating into Unicode encoding form and there is no existing character.
static intSUPPLEMENTARY_MIN_VALUE
The minimum value for Supplementary code points
Method Summary
static intcharCount(int cp)
Cover the JDK 1.5 API, for convenience.
static intcodePointAt(CharSequence seq, int index)
Cover the JDK 1.5 API, for convenience.
static intcodePointAt(char[] text, int index)
Cover the JDK 1.5 API, for convenience.
static intcodePointAt(char[] text, int index, int limit)
Cover the JDK 1.5 API, for convenience.
static intcodePointBefore(CharSequence seq, int index)
Cover the JDK 1.5 API, for convenience.
static intcodePointBefore(char[] text, int index)
Cover the JDK 1.5 API, for convenience.
static intcodePointBefore(char[] text, int index, int limit)
Cover the JDK 1.5 API, for convenience.
static intcodePointCount(CharSequence text, int start, int limit)
Cover the JDK API, for convenience.
static intcodePointCount(char[] text, int start, int limit)
Cover the JDK API, for convenience.
static intdigit(int ch, int radix)
Retrieves the numeric value of a decimal digit code point.
static intdigit(int ch)
Retrieves the numeric value of a decimal digit code point.
static intfoldCase(int ch, boolean defaultmapping)
The given character is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if the character has no case folding equivalent, the character itself is returned.
static StringfoldCase(String str, boolean defaultmapping)
The given string is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if any character has no case folding equivalent, the character itself is returned.
static intfoldCase(int ch, int options)
The given character is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if the character has no case folding equivalent, the character itself is returned.
static StringfoldCase(String str, int options)
The given string is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if any character has no case folding equivalent, the character itself is returned.
static charforDigit(int digit, int radix)
Provide the java.lang.Character forDigit API, for convenience.
static VersionInfogetAge(int ch)

Get the "age" of the code point.

The "age" is the Unicode version when the code point was first designated (as a non-character or for Private Use) or assigned a character.

static intgetCharFromExtendedName(String name)

Find a Unicode character by either its name and return its code point value.

static intgetCharFromName(String name)

Find a Unicode code point by its most current Unicode name and return its code point value.

static intgetCharFromName1_0(String name)

Find a Unicode character by its version 1.0 Unicode name and return its code point value.

static intgetCodePoint(char lead, char trail)
Returns a code point corresponding to the two UTF16 characters.
static intgetCodePoint(char char16)
Returns the code point corresponding to the UTF16 character.
static intgetCombiningClass(int ch)
Gets the combining class of the argument codepoint
static intgetDirection(int ch)
Returns the Bidirection property of a code point.
static bytegetDirectionality(int cp)
Cover the JDK API, for convenience.
static StringgetExtendedName(int ch)

Retrieves a name for a valid codepoint.

static ValueIteratorgetExtendedNameIterator()

Gets an iterator for character names, iterating over codepoints.

This API only gets the iterator for the extended names.

static intgetHanNumericValue(int ch)
Return numeric value of Han code points.
static intgetIntPropertyMaxValue(int type)
Get the maximum value for an integer/binary Unicode property.
static intgetIntPropertyMinValue(int type)
Get the minimum value for an integer/binary Unicode property type.
static intgetIntPropertyValue(int ch, int type)

Gets the property value for an Unicode property type of a code point.

static StringgetISOComment(int ch)
Get the ISO 10646 comment for a character.
static intgetMirror(int ch)
Maps the specified code point to a "mirror-image" code point.
static StringgetName(int ch)
Retrieve the most current Unicode name of the argument code point, or null if the character is unassigned or outside the range UCharacter.MIN_VALUE and UCharacter.MAX_VALUE or does not have a name.
static StringgetName(String s, String separator)
Gets the names for each of the characters in a string
static StringgetName1_0(int ch)
Retrieve the earlier version 1.0 Unicode name of the argument code point, or null if the character is unassigned or outside the range UCharacter.MIN_VALUE and UCharacter.MAX_VALUE or does not have a name.
static ValueIteratorgetName1_0Iterator()

Gets an iterator for character names, iterating over codepoints.

This API only gets the iterator for the older 1.0 Unicode names.

static ValueIteratorgetNameIterator()

Gets an iterator for character names, iterating over codepoints.

This API only gets the iterator for the modern, most up-to-date Unicode names.

static intgetNumericValue(int ch)
Returns the numeric value of the code point as a nonnegative integer.
static intgetPropertyEnum(String propertyAlias)
Return the UProperty selector for a given property name, as specified in the Unicode database file PropertyAliases.txt.
static StringgetPropertyName(int property, int nameChoice)
Return the Unicode name for a given property, as given in the Unicode database file PropertyAliases.txt.
static intgetPropertyValueEnum(int property, String valueAlias)
Return the property value integer for a given value name, as specified in the Unicode database file PropertyValueAliases.txt.
static StringgetPropertyValueName(int property, int value, int nameChoice)
Return the Unicode name for a given property value, as given in the Unicode database file PropertyValueAliases.txt.
static StringgetStringPropertyValue(int propertyEnum, int codepoint, int nameChoice)
Returns a string version of the property value.
static intgetType(int ch)
Returns a value indicating a code point's Unicode category.
static RangeValueIteratorgetTypeIterator()

Gets an iterator for character types, iterating over codepoints.

Example of use:
 RangeValueIterator iterator = UCharacter.getTypeIterator();
 RangeValueIterator.Element element = new RangeValueIterator.Element();
 while (iterator.next(element)) {
     System.out.println("Codepoint \\u" + 
                        Integer.toHexString(element.start) + 
                        " to codepoint \\u" +
                        Integer.toHexString(element.limit - 1) + 
                        " has the character type " + 
                        element.value);
 }
 
static doublegetUnicodeNumericValue(int ch)

Get the numeric value for a Unicode code point as defined in the Unicode Character Database.

A "double" return type is necessary because some numeric values are fractions, negative, or too large for int.

For characters without any numeric values in the Unicode Character Database, this function will return NO_NUMERIC_VALUE.

API Change: In release 2.2 and prior, this API has a return type int and returns -1 when the argument ch does not have a corresponding numeric value.

static VersionInfogetUnicodeVersion()
Gets the version of Unicode data used.
static booleanhasBinaryProperty(int ch, int property)

Check a binary Unicode property for a code point.

Unicode, especially in version 3.2, defines many more properties than the original set in UnicodeData.txt.

This API is intended to reflect Unicode properties as defined in the Unicode Character Database (UCD) and Unicode Technical Reports (UTR).

For details about the properties see http://www.unicode.org/.

For names of Unicode properties see the UCD file PropertyAliases.txt.

This API does not check the validity of the codepoint.

Important: If ICU is built with UCD files from Unicode versions below 3.2, then properties marked with "new" are not or not fully available.

static booleanisBaseForm(int ch)
Determines whether the specified code point is of base form.
static booleanisBMP(int ch)
Determines if the code point is in the BMP plane.
static booleanisDefined(int ch)
Determines if a code point has a defined meaning in the up-to-date Unicode standard.
static booleanisDigit(int ch)
Determines if a code point is a Java digit.
static booleanisHighSurrogate(char ch)
Cover the JDK 1.5 API, for convenience.
static booleanisIdentifierIgnorable(int ch)
Determines if the specified code point should be regarded as an ignorable character in a Unicode identifier.
static booleanisISOControl(int ch)
Determines if the specified code point is an ISO control character.
static booleanisJavaIdentifierPart(int cp)
Compatibility override of Java method, delegates to java.lang.Character.isJavaIdentifierPart.
static booleanisJavaIdentifierStart(int cp)
Compatibility override of Java method, delegates to java.lang.Character.isJavaIdentifierStart.
static booleanisJavaLetter(int cp)
Compatibility override of Java deprecated method.
static booleanisJavaLetterOrDigit(int cp)
Compatibility override of Java deprecated method.
static booleanisLegal(int ch)
A code point is illegal if and only if
  • Out of bounds, less than 0 or greater than UCharacter.MAX_VALUE
  • A surrogate value, 0xD800 to 0xDFFF
  • Not-a-character, having the form 0x xxFFFF or 0x xxFFFE
Note: legal does not mean that it is assigned in this version of Unicode.
static booleanisLegal(String str)
A string is legal iff all its code points are legal.
static booleanisLetter(int ch)
Determines if the specified code point is a letter.
static booleanisLetterOrDigit(int ch)
Determines if the specified code point is a letter or digit.
static booleanisLowerCase(int ch)
Determines if the specified code point is a lowercase character.
static booleanisLowSurrogate(char ch)
Cover the JDK 1.5 API, for convenience.
static booleanisMirrored(int ch)
Determines whether the code point has the "mirrored" property.
static booleanisPrintable(int ch)
Determines whether the specified code point is a printable character according to the Unicode standard.
static booleanisSpace(int ch)
Compatibility override of Java deprecated method.
static booleanisSpaceChar(int ch)
Determines if the specified code point is a Unicode specified space character, i.e. if code point is in the category Zs, Zl and Zp.
static booleanisSupplementary(int ch)
Determines if the code point is a supplementary character.
static booleanisSupplementaryCodePoint(int cp)
Cover the JDK 1.5 API, for convenience.
static booleanisSurrogatePair(char high, char low)
Cover the JDK 1.5 API, for convenience.
static booleanisTitleCase(int ch)
Determines if the specified code point is a titlecase character.
static booleanisUAlphabetic(int ch)

Check if a code point has the Alphabetic Unicode property.

Same as UCharacter.hasBinaryProperty(ch, UProperty.ALPHABETIC).

Different from UCharacter.isLetter(ch)!

static booleanisULowercase(int ch)

Check if a code point has the Lowercase Unicode property.

Same as UCharacter.hasBinaryProperty(ch, UProperty.LOWERCASE).

This is different from UCharacter.isLowerCase(ch)!

static booleanisUnicodeIdentifierPart(int ch)
Determines if the specified code point may be any part of a Unicode identifier other than the starting character.
static booleanisUnicodeIdentifierStart(int ch)
Determines if the specified code point is permissible as the first character in a Unicode identifier.
static booleanisUpperCase(int ch)
Determines if the specified code point is an uppercase character.
static booleanisUUppercase(int ch)

Check if a code point has the Uppercase Unicode property.

Same as UCharacter.hasBinaryProperty(ch, UProperty.UPPERCASE).

This is different from UCharacter.isUpperCase(ch)!

static booleanisUWhiteSpace(int ch)

Check if a code point has the White_Space Unicode property.

Same as UCharacter.hasBinaryProperty(ch, UProperty.WHITE_SPACE).

This is different from both UCharacter.isSpace(ch) and UCharacter.isWhitespace(ch)!

static booleanisValidCodePoint(int cp)
Cover the JDK 1.5 API, for convenience.
static booleanisWhitespace(int ch)
Determines if the specified code point is a white space character.
static intoffsetByCodePoints(CharSequence text, int index, int codePointOffset)
Cover the JDK API, for convenience.
static intoffsetByCodePoints(char[] text, int start, int count, int index, int codePointOffset)
Cover the JDK API, for convenience.
static inttoChars(int cp, char[] dst, int dstIndex)
Cover the JDK 1.5 API, for convenience.
static char[]toChars(int cp)
Cover the JDK 1.5 API, for convenience.
static inttoCodePoint(char high, char low)
Cover the JDK 1.5 API, for convenience.
static inttoLowerCase(int ch)
The given code point is mapped to its lowercase equivalent; if the code point has no lowercase equivalent, the code point itself is returned.
static StringtoLowerCase(String str)
Gets lowercase version of the argument string.
static StringtoLowerCase(Locale locale, String str)
Gets lowercase version of the argument string.
static StringtoLowerCase(ULocale locale, String str)
Gets lowercase version of the argument string.
static StringtoString(int ch)
Converts argument code point and returns a String object representing the code point's value in UTF16 format.
static inttoTitleCase(int ch)
Converts the code point argument to titlecase.
static StringtoTitleCase(String str, BreakIterator breakiter)

Gets the titlecase version of the argument string.

Position for titlecasing is determined by the argument break iterator, hence the user can customized his break iterator for a specialized titlecasing.

static StringtoTitleCase(Locale locale, String str, BreakIterator breakiter)

Gets the titlecase version of the argument string.

Position for titlecasing is determined by the argument break iterator, hence the user can customized his break iterator for a specialized titlecasing.

static StringtoTitleCase(ULocale locale, String str, BreakIterator titleIter)

Gets the titlecase version of the argument string.

Position for titlecasing is determined by the argument break iterator, hence the user can customized his break iterator for a specialized titlecasing.

static inttoUpperCase(int ch)
Converts the character argument to uppercase.
static StringtoUpperCase(String str)
Gets uppercase version of the argument string.
static StringtoUpperCase(Locale locale, String str)
Gets uppercase version of the argument string.
static StringtoUpperCase(ULocale locale, String str)
Gets uppercase version of the argument string.

Field Detail

FOLD_CASE_DEFAULT

public static final int FOLD_CASE_DEFAULT
Option value for case folding: use default mappings defined in CaseFolding.txt.

UNKNOWN: ICU 2.6

FOLD_CASE_EXCLUDE_SPECIAL_I

public static final int FOLD_CASE_EXCLUDE_SPECIAL_I
Option value for case folding: exclude the mappings for dotted I and dotless i marked with 'I' in CaseFolding.txt.

UNKNOWN: ICU 2.6

MAX_CODE_POINT

public static final int MAX_CODE_POINT
Cover the JDK 1.5 API, for convenience.

See Also: CODEPOINT_MAX_VALUE

UNKNOWN: ICU 3.0

MAX_HIGH_SURROGATE

public static final char MAX_HIGH_SURROGATE
Cover the JDK 1.5 API, for convenience.

See Also: LEAD_SURROGATE_MAX_VALUE

UNKNOWN: ICU 3.0

MAX_LOW_SURROGATE

public static final char MAX_LOW_SURROGATE
Cover the JDK 1.5 API, for convenience.

See Also: TRAIL_SURROGATE_MAX_VALUE

UNKNOWN: ICU 3.0

MAX_RADIX

public static final int MAX_RADIX
Compatibility constant for Java Character's MAX_RADIX.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

MAX_SURROGATE

public static final char MAX_SURROGATE
Cover the JDK 1.5 API, for convenience.

See Also: SURROGATE_MAX_VALUE

UNKNOWN: ICU 3.0

MAX_VALUE

public static final int MAX_VALUE
The highest Unicode code point value (scalar value) according to the Unicode Standard. This is a 21-bit value (21 bits, rounded up).
Up-to-date Unicode implementation of java.lang.Character.MIN_VALUE

UNKNOWN: ICU 2.1

MIN_CODE_POINT

public static final int MIN_CODE_POINT
Cover the JDK 1.5 API, for convenience.

See Also: CODEPOINT_MIN_VALUE

UNKNOWN: ICU 3.0

MIN_HIGH_SURROGATE

public static final char MIN_HIGH_SURROGATE
Cover the JDK 1.5 API, for convenience.

See Also: LEAD_SURROGATE_MIN_VALUE

UNKNOWN: ICU 3.0

MIN_LOW_SURROGATE

public static final char MIN_LOW_SURROGATE
Cover the JDK 1.5 API, for convenience.

See Also: TRAIL_SURROGATE_MIN_VALUE

UNKNOWN: ICU 3.0

MIN_RADIX

public static final int MIN_RADIX
Compatibility constant for Java Character's MIN_RADIX.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

MIN_SUPPLEMENTARY_CODE_POINT

public static final int MIN_SUPPLEMENTARY_CODE_POINT
Cover the JDK 1.5 API, for convenience.

See Also: SUPPLEMENTARY_MIN_VALUE

UNKNOWN: ICU 3.0

MIN_SURROGATE

public static final char MIN_SURROGATE
Cover the JDK 1.5 API, for convenience.

See Also: SURROGATE_MIN_VALUE

UNKNOWN: ICU 3.0

MIN_VALUE

public static final int MIN_VALUE
The lowest Unicode code point value.

UNKNOWN: ICU 2.1

NO_NUMERIC_VALUE

public static final double NO_NUMERIC_VALUE
Special value that is returned by getUnicodeNumericValue(int) when no numeric value is defined for a code point.

See Also: UCharacter

UNKNOWN: ICU 2.4

REPLACEMENT_CHAR

public static final int REPLACEMENT_CHAR
Unicode value used when translating into Unicode encoding form and there is no existing character.

UNKNOWN: ICU 2.1

SUPPLEMENTARY_MIN_VALUE

public static final int SUPPLEMENTARY_MIN_VALUE
The minimum value for Supplementary code points

UNKNOWN: ICU 2.1

Method Detail

charCount

public static int charCount(int cp)
Cover the JDK 1.5 API, for convenience. Return the number of chars needed to represent the code point. This does not check the code point for validity.

Parameters: cp the code point to check

Returns: the number of chars needed to represent the code point

See Also: UTF16

UNKNOWN: ICU 3.0

codePointAt

public static final int codePointAt(CharSequence seq, int index)
Cover the JDK 1.5 API, for convenience. Return the code point at index.
Note: the semantics of this API is different from the related UTF16 API. This examines only the characters at index and index+1.

Parameters: seq the characters to check index the index of the first or only char forming the code point

Returns: the code point at the index

UNKNOWN: ICU 3.0

codePointAt

public static final int codePointAt(char[] text, int index)
Cover the JDK 1.5 API, for convenience. Return the code point at index.
Note: the semantics of this API is different from the related UTF16 API. This examines only the characters at index and index+1.

Parameters: text the characters to check index the index of the first or only char forming the code point

Returns: the code point at the index

UNKNOWN: ICU 3.0

codePointAt

public static final int codePointAt(char[] text, int index, int limit)
Cover the JDK 1.5 API, for convenience. Return the code point at index.
Note: the semantics of this API is different from the related UTF16 API. This examines only the characters at index and index+1.

Parameters: text the characters to check index the index of the first or only char forming the code point limit the limit of the valid text

Returns: the code point at the index

UNKNOWN: ICU 3.0

codePointBefore

public static final int codePointBefore(CharSequence seq, int index)
Cover the JDK 1.5 API, for convenience. Return the code point before index.
Note: the semantics of this API is different from the related UTF16 API. This examines only the characters at index-1 and index-2.

Parameters: seq the characters to check index the index after the last or only char forming the code point

Returns: the code point before the index

UNKNOWN: ICU 3.0

codePointBefore

public static final int codePointBefore(char[] text, int index)
Cover the JDK 1.5 API, for convenience. Return the code point before index.
Note: the semantics of this API is different from the related UTF16 API. This examines only the characters at index-1 and index-2.

Parameters: text the characters to check index the index after the last or only char forming the code point

Returns: the code point before the index

UNKNOWN: ICU 3.0

codePointBefore

public static final int codePointBefore(char[] text, int index, int limit)
Cover the JDK 1.5 API, for convenience. Return the code point before index.
Note: the semantics of this API is different from the related UTF16 API. This examines only the characters at index-1 and index-2.

Parameters: text the characters to check index the index after the last or only char forming the code point limit the start of the valid text

Returns: the code point before the index

UNKNOWN: ICU 3.0

codePointCount

public static int codePointCount(CharSequence text, int start, int limit)
Cover the JDK API, for convenience. Count the number of code points in the range of text.

Parameters: text the characters to check start the start of the range limit the limit of the range

Returns: the number of code points in the range

UNKNOWN: ICU 3.0

codePointCount

public static int codePointCount(char[] text, int start, int limit)
Cover the JDK API, for convenience. Count the number of code points in the range of text.

Parameters: text the characters to check start the start of the range limit the limit of the range

Returns: the number of code points in the range

UNKNOWN: ICU 3.0

digit

public static int digit(int ch, int radix)
Retrieves the numeric value of a decimal digit code point.
This method observes the semantics of java.lang.Character.digit(). Note that this will return positive values for code points for which isDigit returns false, just like java.lang.Character.
Semantic Change: In release 1.3.1 and prior, this did not treat the European letters as having a digit value, and also treated numeric letters and other numbers as digits. This has been changed to conform to the java semantics.
A code point is a valid digit if and only if:

Parameters: ch the code point to query radix the radix

Returns: the numeric value represented by the code point in the specified radix, or -1 if the code point is not a decimal digit or if its value is too large for the radix

UNKNOWN: ICU 2.1

digit

public static int digit(int ch)
Retrieves the numeric value of a decimal digit code point.
This is a convenience overload of digit(int, int) that provides a decimal radix.
Semantic Change: In release 1.3.1 and prior, this treated numeric letters and other numbers as digits. This has been changed to conform to the java semantics.

Parameters: ch the code point to query

Returns: the numeric value represented by the code point, or -1 if the code point is not a decimal digit or if its value is too large for a decimal radix

UNKNOWN: ICU 2.1

foldCase

public static int foldCase(int ch, boolean defaultmapping)
The given character is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if the character has no case folding equivalent, the character itself is returned.

This function only returns the simple, single-code point case mapping. Full case mappings should be used whenever possible because they produce better results by working on whole strings. They can map to a result string with a different length as appropriate. Full case mappings are applied by the case mapping functions that take String parameters rather than code points (int). See also the User Guide chapter on C/POSIX migration: http://icu.sourceforge.net/userguide/posix.html#case_mappings

Parameters: ch the character to be converted defaultmapping Indicates if all mappings defined in CaseFolding.txt is to be used, otherwise the mappings for dotted I and dotless i marked with 'I' in CaseFolding.txt will be skipped.

Returns: the case folding equivalent of the character, if any; otherwise the character itself.

See Also: UCharacter

UNKNOWN: ICU 2.1

foldCase

public static String foldCase(String str, boolean defaultmapping)
The given string is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if any character has no case folding equivalent, the character itself is returned. "Full", multiple-code point case folding mappings are returned here. For "simple" single-code point mappings use the API foldCase(int ch, boolean defaultmapping).

Parameters: str the String to be converted defaultmapping Indicates if all mappings defined in CaseFolding.txt is to be used, otherwise the mappings for dotted I and dotless i marked with 'I' in CaseFolding.txt will be skipped.

Returns: the case folding equivalent of the character, if any; otherwise the character itself.

See Also: UCharacter

UNKNOWN: ICU 2.1

foldCase

public static int foldCase(int ch, int options)
The given character is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if the character has no case folding equivalent, the character itself is returned.

This function only returns the simple, single-code point case mapping. Full case mappings should be used whenever possible because they produce better results by working on whole strings. They can map to a result string with a different length as appropriate. Full case mappings are applied by the case mapping functions that take String parameters rather than code points (int). See also the User Guide chapter on C/POSIX migration: http://icu.sourceforge.net/userguide/posix.html#case_mappings

Parameters: ch the character to be converted options A bit set for special processing. Currently the recognised options are FOLD_CASE_EXCLUDE_SPECIAL_I and FOLD_CASE_DEFAULT

Returns: the case folding equivalent of the character, if any; otherwise the character itself.

See Also: UCharacter

UNKNOWN: ICU 2.6

foldCase

public static final String foldCase(String str, int options)
The given string is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if any character has no case folding equivalent, the character itself is returned. "Full", multiple-code point case folding mappings are returned here. For "simple" single-code point mappings use the API foldCase(int ch, boolean defaultmapping).

Parameters: str the String to be converted options A bit set for special processing. Currently the recognised options are FOLD_CASE_EXCLUDE_SPECIAL_I and FOLD_CASE_DEFAULT

Returns: the case folding equivalent of the character, if any; otherwise the character itself.

See Also: UCharacter

UNKNOWN: ICU 2.6

forDigit

public static char forDigit(int digit, int radix)
Provide the java.lang.Character forDigit API, for convenience.

UNKNOWN: ICU 3.0

getAge

public static VersionInfo getAge(int ch)

Get the "age" of the code point.

The "age" is the Unicode version when the code point was first designated (as a non-character or for Private Use) or assigned a character.

This can be useful to avoid emitting code points to receiving processes that do not accept newer characters.

The data is from the UCD file DerivedAge.txt.

Parameters: ch The code point.

Returns: the Unicode version number

UNKNOWN: ICU 2.6

getCharFromExtendedName

public static int getCharFromExtendedName(String name)

Find a Unicode character by either its name and return its code point value. All Unicode names are in uppercase. Extended names are all lowercase except for numbers and are contained within angle brackets.

The names are searched in the following order Note calling any methods related to code point names, e.g. get*Name*() incurs a one-time initialisation cost to construct the name tables.

Parameters: name codepoint name

Returns: code point associated with the name or -1 if the name is not found.

UNKNOWN: ICU 2.6

getCharFromName

public static int getCharFromName(String name)

Find a Unicode code point by its most current Unicode name and return its code point value. All Unicode names are in uppercase.

Note calling any methods related to code point names, e.g. get*Name*() incurs a one-time initialisation cost to construct the name tables.

Parameters: name most current Unicode character name whose code point is to be returned

Returns: code point or -1 if name is not found

UNKNOWN: ICU 2.1

getCharFromName1_0

public static int getCharFromName1_0(String name)

Find a Unicode character by its version 1.0 Unicode name and return its code point value. All Unicode names are in uppercase.

Note calling any methods related to code point names, e.g. get*Name*() incurs a one-time initialisation cost to construct the name tables.

Parameters: name Unicode 1.0 code point name whose code point is to returned

Returns: code point or -1 if name is not found

UNKNOWN: ICU 2.1

getCodePoint

public static int getCodePoint(char lead, char trail)
Returns a code point corresponding to the two UTF16 characters.

Parameters: lead the lead char trail the trail char

Returns: code point if surrogate characters are valid.

Throws: IllegalArgumentException thrown when argument characters do not form a valid codepoint

UNKNOWN: ICU 2.1

getCodePoint

public static int getCodePoint(char char16)
Returns the code point corresponding to the UTF16 character.

Parameters: char16 the UTF16 character

Returns: code point if argument is a valid character.

Throws: IllegalArgumentException thrown when char16 is not a valid codepoint

UNKNOWN: ICU 2.1

getCombiningClass

public static int getCombiningClass(int ch)
Gets the combining class of the argument codepoint

Parameters: ch code point whose combining is to be retrieved

Returns: the combining class of the codepoint

UNKNOWN: ICU 2.1

getDirection

public static int getDirection(int ch)
Returns the Bidirection property of a code point. For example, 0x0041 (letter A) has the LEFT_TO_RIGHT directional property.
Result returned belongs to the interface UCharacterDirection

Parameters: ch the code point to be determined its direction

Returns: direction constant from UCharacterDirection.

UNKNOWN: ICU 2.1

getDirectionality

public static byte getDirectionality(int cp)
Cover the JDK API, for convenience. Return a byte representing the directionality of the character.
Note: Unlike the JDK, this returns DIRECTIONALITY_LEFT_TO_RIGHT for undefined or out-of-bounds characters.
Note: The return value must be tested using the constants defined in {@link UCharacterEnums.ECharacterDirection} since the values are different from the ones defined by java.lang.Character.

Parameters: cp the code point to check

Returns: the directionality of the code point

See Also: UCharacter

UNKNOWN: ICU 3.0

getExtendedName

public static String getExtendedName(int ch)

Retrieves a name for a valid codepoint. Unlike, getName(int) and getName1_0(int), this method will return a name even for codepoints that are not assigned a name in UnicodeData.txt.

The names are returned in the following order. Note calling any methods related to code point names, e.g. get*Name*() incurs a one-time initialisation cost to construct the name tables.

Parameters: ch the code point for which to get the name

Returns: a name for the argument codepoint

UNKNOWN: ICU 2.6

getExtendedNameIterator

public static ValueIterator getExtendedNameIterator()

Gets an iterator for character names, iterating over codepoints.

This API only gets the iterator for the extended names. For modern, most up-to-date Unicode names use getNameIterator() or for older 1.0 Unicode names use get1_0NameIterator().

Example of use:
 ValueIterator iterator = UCharacter.getExtendedNameIterator();
 ValueIterator.Element element = new ValueIterator.Element();
 while (iterator.next(element)) {
     System.out.println("Codepoint \\u" + 
                        Integer.toHexString(element.codepoint) +
                        " has the name " + (String)element.value);
 }
 

The maximal range which the name iterator iterates is from

Returns: an iterator

UNKNOWN: ICU 2.6

getHanNumericValue

public static int getHanNumericValue(int ch)
Return numeric value of Han code points.
This returns the value of Han 'numeric' code points, including those for zero, ten, hundred, thousand, ten thousand, and hundred million. This includes both the standard and 'checkwriting' characters, the 'big circle' zero character, and the standard zero character.

Parameters: ch code point to query

Returns: value if it is a Han 'numeric character,' otherwise return -1.

UNKNOWN: ICU 2.4

getIntPropertyMaxValue

public static int getIntPropertyMaxValue(int type)
Get the maximum value for an integer/binary Unicode property. Can be used together with UCharacter.getIntPropertyMinValue(int) to allocate arrays of com.ibm.icu.text.UnicodeSet or similar. Examples for min/max values (for Unicode 3.2): For undefined UProperty constant values, min/max values will be 0/-1.

Parameters: type UProperty selector constant, identifies which binary property to check. Must be UProperty.BINARY_START <= type < UProperty.BINARY_LIMIT or UProperty.INT_START <= type < UProperty.INT_LIMIT.

Returns: Maximum value returned by u_getIntPropertyValue for a Unicode property. <= 0 if the property selector 'type' is out of range.

See Also: UProperty UCharacter UCharacter UCharacter UCharacter

UNKNOWN: ICU 2.4

getIntPropertyMinValue

public static int getIntPropertyMinValue(int type)
Get the minimum value for an integer/binary Unicode property type. Can be used together with UCharacter.getIntPropertyMaxValue(int) to allocate arrays of com.ibm.icu.text.UnicodeSet or similar.

Parameters: type UProperty selector constant, identifies which binary property to check. Must be UProperty.BINARY_START <= type < UProperty.BINARY_LIMIT or UProperty.INT_START <= type < UProperty.INT_LIMIT.

Returns: Minimum value returned by UCharacter.getIntPropertyValue(int) for a Unicode property. 0 if the property selector 'type' is out of range.

See Also: UProperty UCharacter UCharacter UCharacter UCharacter

UNKNOWN: ICU 2.4

getIntPropertyValue

public static int getIntPropertyValue(int ch, int type)

Gets the property value for an Unicode property type of a code point. Also returns binary and mask property values.

Unicode, especially in version 3.2, defines many more properties than the original set in UnicodeData.txt.

The properties APIs are intended to reflect Unicode properties as defined in the Unicode Character Database (UCD) and Unicode Technical Reports (UTR). For details about the properties see http://www.unicode.org/.

For names of Unicode properties see the UCD file PropertyAliases.txt.

 Sample usage:
 int ea = UCharacter.getIntPropertyValue(c, UProperty.EAST_ASIAN_WIDTH);
 int ideo = UCharacter.getIntPropertyValue(c, UProperty.IDEOGRAPHIC);
 boolean b = (ideo == 1) ? true : false; 
 

Parameters: ch code point to test. type UProperty selector constant, identifies which binary property to check. Must be UProperty.BINARY_START <= type < UProperty.BINARY_LIMIT or UProperty.INT_START <= type < UProperty.INT_LIMIT or UProperty.MASK_START <= type < UProperty.MASK_LIMIT.

Returns: numeric value that is directly the property value or, for enumerated properties, corresponds to the numeric value of the enumerated constant of the respective property value enumeration type (cast to enum type if necessary). Returns 0 or 1 (for false / true) for binary Unicode properties. Returns a bit-mask for mask properties. Returns 0 if 'type' is out of bounds or if the Unicode version does not have data for the property at all, or not for this code point.

See Also: UProperty UCharacter UCharacter UCharacter UCharacter

UNKNOWN: ICU 2.4

getISOComment

public static String getISOComment(int ch)
Get the ISO 10646 comment for a character. The ISO 10646 comment is an informative field in the Unicode Character Database (UnicodeData.txt field 11) and is from the ISO 10646 names list.

Parameters: ch The code point for which to get the ISO comment. It must be 0<=c<=0x10ffff.

Returns: The ISO comment, or null if there is no comment for this character.

UNKNOWN: ICU 2.4

getMirror

public static int getMirror(int ch)
Maps the specified code point to a "mirror-image" code point. For code points with the "mirrored" property, implementations sometimes need a "poor man's" mapping to another code point such that the default glyph may serve as the mirror-image of the default glyph of the specified code point.
This is useful for text conversion to and from codepages with visual order, and for displays without glyph selection capabilities.

Parameters: ch code point whose mirror is to be retrieved

Returns: another code point that may serve as a mirror-image substitute, or ch itself if there is no such mapping or ch does not have the "mirrored" property

UNKNOWN: ICU 2.1

getName

public static String getName(int ch)
Retrieve the most current Unicode name of the argument code point, or null if the character is unassigned or outside the range UCharacter.MIN_VALUE and UCharacter.MAX_VALUE or does not have a name.
Note calling any methods related to code point names, e.g. get*Name*() incurs a one-time initialisation cost to construct the name tables.

Parameters: ch the code point for which to get the name

Returns: most current Unicode name

UNKNOWN: ICU 2.1

getName

public static String getName(String s, String separator)

Deprecated: This API is ICU internal only.

Gets the names for each of the characters in a string

Parameters: s string to format separator string to go between names

Returns: string of names

UNKNOWN:

getName1_0

public static String getName1_0(int ch)
Retrieve the earlier version 1.0 Unicode name of the argument code point, or null if the character is unassigned or outside the range UCharacter.MIN_VALUE and UCharacter.MAX_VALUE or does not have a name.
Note calling any methods related to code point names, e.g. get*Name*() incurs a one-time initialisation cost to construct the name tables.

Parameters: ch the code point for which to get the name

Returns: version 1.0 Unicode name

UNKNOWN: ICU 2.1

getName1_0Iterator

public static ValueIterator getName1_0Iterator()

Gets an iterator for character names, iterating over codepoints.

This API only gets the iterator for the older 1.0 Unicode names. For modern, most up-to-date Unicode names use getNameIterator() or for extended names use getExtendedNameIterator().

Example of use:
 ValueIterator iterator = UCharacter.get1_0NameIterator();
 ValueIterator.Element element = new ValueIterator.Element();
 while (iterator.next(element)) {
     System.out.println("Codepoint \\u" + 
                        Integer.toHexString(element.codepoint) +
                        " has the name " + (String)element.value);
 }
 

The maximal range which the name iterator iterates is from

Returns: an iterator

UNKNOWN: ICU 2.6

getNameIterator

public static ValueIterator getNameIterator()

Gets an iterator for character names, iterating over codepoints.

This API only gets the iterator for the modern, most up-to-date Unicode names. For older 1.0 Unicode names use get1_0NameIterator() or for extended names use getExtendedNameIterator().

Example of use:
 ValueIterator iterator = UCharacter.getNameIterator();
 ValueIterator.Element element = new ValueIterator.Element();
 while (iterator.next(element)) {
     System.out.println("Codepoint \\u" + 
                        Integer.toHexString(element.codepoint) +
                        " has the name " + (String)element.value);
 }
 

The maximal range which the name iterator iterates is from UCharacter.MIN_VALUE to UCharacter.MAX_VALUE.

Returns: an iterator

UNKNOWN: ICU 2.6

getNumericValue

public static int getNumericValue(int ch)
Returns the numeric value of the code point as a nonnegative integer.
If the code point does not have a numeric value, then -1 is returned.
If the code point has a numeric value that cannot be represented as a nonnegative integer (for example, a fractional value), then -2 is returned.

Parameters: ch the code point to query

Returns: the numeric value of the code point, or -1 if it has no numeric value, or -2 if it has a numeric value that cannot be represented as a nonnegative integer

UNKNOWN: ICU 2.1

getPropertyEnum

public static int getPropertyEnum(String propertyAlias)
Return the UProperty selector for a given property name, as specified in the Unicode database file PropertyAliases.txt. Short, long, and any other variants are recognized. In addition, this function maps the synthetic names "gcm" / "General_Category_Mask" to the property UProperty.GENERAL_CATEGORY_MASK. These names are not in PropertyAliases.txt.

Parameters: propertyAlias the property name to be matched. The name is compared using "loose matching" as described in PropertyAliases.txt.

Returns: a UProperty enum.

Throws: IllegalArgumentException thrown if propertyAlias is not recognized.

See Also: UProperty

UNKNOWN: ICU 2.4

getPropertyName

public static String getPropertyName(int property, int nameChoice)
Return the Unicode name for a given property, as given in the Unicode database file PropertyAliases.txt. Most properties have more than one name. The nameChoice determines which one is returned. In addition, this function maps the property UProperty.GENERAL_CATEGORY_MASK to the synthetic names "gcm" / "General_Category_Mask". These names are not in PropertyAliases.txt.

Parameters: property UProperty selector. nameChoice UProperty.NameChoice selector for which name to get. All properties have a long name. Most have a short name, but some do not. Unicode allows for additional names; if present these will be returned by UProperty.NameChoice.LONG + i, where i=1, 2,...

Returns: a name, or null if Unicode explicitly defines no name ("n/a") for a given property/nameChoice. If a given nameChoice throws an exception, then all larger values of nameChoice will throw an exception. If null is returned for a given nameChoice, then other nameChoice values may return non-null results.

Throws: IllegalArgumentException thrown if property or nameChoice are invalid.

See Also: UProperty NameChoice

UNKNOWN: ICU 2.4

getPropertyValueEnum

public static int getPropertyValueEnum(int property, String valueAlias)
Return the property value integer for a given value name, as specified in the Unicode database file PropertyValueAliases.txt. Short, long, and any other variants are recognized. Note: Some of the names in PropertyValueAliases.txt will only be recognized with UProperty.GENERAL_CATEGORY_MASK, not UProperty.GENERAL_CATEGORY. These include: "C" / "Other", "L" / "Letter", "LC" / "Cased_Letter", "M" / "Mark", "N" / "Number", "P" / "Punctuation", "S" / "Symbol", and "Z" / "Separator".

Parameters: property UProperty selector constant. UProperty.INT_START <= property < UProperty.INT_LIMIT or UProperty.BINARY_START <= property < UProperty.BINARY_LIMIT or UProperty.MASK_START < = property < UProperty.MASK_LIMIT. Only these properties can be enumerated. valueAlias the value name to be matched. The name is compared using "loose matching" as described in PropertyValueAliases.txt.

Returns: a value integer. Note: UProperty.GENERAL_CATEGORY values are mask values produced by left-shifting 1 by UCharacter.getType(). This allows grouped categories such as [:L:] to be represented.

Throws: IllegalArgumentException if property is not a valid UProperty selector

See Also: UProperty

UNKNOWN: ICU 2.4

getPropertyValueName

public static String getPropertyValueName(int property, int value, int nameChoice)
Return the Unicode name for a given property value, as given in the Unicode database file PropertyValueAliases.txt. Most values have more than one name. The nameChoice determines which one is returned. Note: Some of the names in PropertyValueAliases.txt can only be retrieved using UProperty.GENERAL_CATEGORY_MASK, not UProperty.GENERAL_CATEGORY. These include: "C" / "Other", "L" / "Letter", "LC" / "Cased_Letter", "M" / "Mark", "N" / "Number", "P" / "Punctuation", "S" / "Symbol", and "Z" / "Separator".

Parameters: property UProperty selector constant. UProperty.INT_START <= property < UProperty.INT_LIMIT or UProperty.BINARY_START <= property < UProperty.BINARY_LIMIT or UProperty.MASK_START < = property < UProperty.MASK_LIMIT. If out of range, null is returned. value selector for a value for the given property. In general, valid values range from 0 up to some maximum. There are a few exceptions: (1.) UProperty.BLOCK values begin at the non-zero value BASIC_LATIN.getID(). (2.) UProperty.CANONICAL_COMBINING_CLASS values are not contiguous and range from 0..240. (3.) UProperty.GENERAL_CATEGORY_MASK values are mask values produced by left-shifting 1 by UCharacter.getType(). This allows grouped categories such as [:L:] to be represented. Mask values are non-contiguous. nameChoice UProperty.NameChoice selector for which name to get. All values have a long name. Most have a short name, but some do not. Unicode allows for additional names; if present these will be returned by UProperty.NameChoice.LONG + i, where i=1, 2,...

Returns: a name, or null if Unicode explicitly defines no name ("n/a") for a given property/value/nameChoice. If a given nameChoice throws an exception, then all larger values of nameChoice will throw an exception. If null is returned for a given nameChoice, then other nameChoice values may return non-null results.

Throws: IllegalArgumentException thrown if property, value, or nameChoice are invalid.

See Also: UProperty NameChoice

UNKNOWN: ICU 2.4

getStringPropertyValue

public static String getStringPropertyValue(int propertyEnum, int codepoint, int nameChoice)

Deprecated: This API is ICU internal only.

Returns a string version of the property value.

Parameters: propertyEnum codepoint nameChoice

Returns: value as string

UNKNOWN:

getType

public static int getType(int ch)
Returns a value indicating a code point's Unicode category. Up-to-date Unicode implementation of java.lang.Character.getType() except for the above mentioned code points that had their category changed.
Return results are constants from the interface UCharacterCategory
NOTE: the UCharacterCategory values are not compatible with those returned by java.lang.Character.getType. UCharacterCategory values match the ones used in ICU4C, while java.lang.Character type values, though similar, skip the value 17.

Parameters: ch code point whose type is to be determined

Returns: category which is a value of UCharacterCategory

UNKNOWN: ICU 2.1

getTypeIterator

public static RangeValueIterator getTypeIterator()

Gets an iterator for character types, iterating over codepoints.

Example of use:
 RangeValueIterator iterator = UCharacter.getTypeIterator();
 RangeValueIterator.Element element = new RangeValueIterator.Element();
 while (iterator.next(element)) {
     System.out.println("Codepoint \\u" + 
                        Integer.toHexString(element.start) + 
                        " to codepoint \\u" +
                        Integer.toHexString(element.limit - 1) + 
                        " has the character type " + 
                        element.value);
 }
 

Returns: an iterator

UNKNOWN: ICU 2.6

getUnicodeNumericValue

public static double getUnicodeNumericValue(int ch)

Get the numeric value for a Unicode code point as defined in the Unicode Character Database.

A "double" return type is necessary because some numeric values are fractions, negative, or too large for int.

For characters without any numeric values in the Unicode Character Database, this function will return NO_NUMERIC_VALUE.

API Change: In release 2.2 and prior, this API has a return type int and returns -1 when the argument ch does not have a corresponding numeric value. This has been changed to synch with ICU4C

This corresponds to the ICU4C function u_getNumericValue.

Parameters: ch Code point to get the numeric value for.

Returns: numeric value of ch, or NO_NUMERIC_VALUE if none is defined.

UNKNOWN: ICU 2.4

getUnicodeVersion

public static VersionInfo getUnicodeVersion()
Gets the version of Unicode data used.

Returns: the unicode version number used

UNKNOWN: ICU 2.1

hasBinaryProperty

public static boolean hasBinaryProperty(int ch, int property)

Check a binary Unicode property for a code point.

Unicode, especially in version 3.2, defines many more properties than the original set in UnicodeData.txt.

This API is intended to reflect Unicode properties as defined in the Unicode Character Database (UCD) and Unicode Technical Reports (UTR).

For details about the properties see http://www.unicode.org/.

For names of Unicode properties see the UCD file PropertyAliases.txt.

This API does not check the validity of the codepoint.

Important: If ICU is built with UCD files from Unicode versions below 3.2, then properties marked with "new" are not or not fully available.

Parameters: ch code point to test. property selector constant from com.ibm.icu.lang.UProperty, identifies which binary property to check.

Returns: true or false according to the binary Unicode property value for ch. Also false if property is out of bounds or if the Unicode version does not have data for the property at all, or not for this code point.

See Also: UProperty

UNKNOWN: ICU 2.6

isBaseForm

public static boolean isBaseForm(int ch)
Determines whether the specified code point is of base form. A code point of base form does not graphically combine with preceding characters, and is neither a control nor a format character.

Parameters: ch code point to be determined if it is of base form

Returns: true if the code point is of base form

UNKNOWN: ICU 2.1

isBMP

public static boolean isBMP(int ch)
Determines if the code point is in the BMP plane.

Parameters: ch code point to be determined if it is not a supplementary character

Returns: true if code point is not a supplementary character

UNKNOWN: ICU 2.1

isDefined

public static boolean isDefined(int ch)
Determines if a code point has a defined meaning in the up-to-date Unicode standard. E.g. supplementary code points though allocated space are not defined in Unicode yet.
Up-to-date Unicode implementation of java.lang.Character.isDefined()

Parameters: ch code point to be determined if it is defined in the most current version of Unicode

Returns: true if this code point is defined in unicode

UNKNOWN: ICU 2.1

isDigit

public static boolean isDigit(int ch)
Determines if a code point is a Java digit.
This method observes the semantics of java.lang.Character.isDigit(). It returns true for decimal digits only.
Semantic Change: In release 1.3.1 and prior, this treated numeric letters and other numbers as digits. This has been changed to conform to the java semantics.

Parameters: ch code point to query

Returns: true if this code point is a digit

UNKNOWN: ICU 2.1

isHighSurrogate

public static boolean isHighSurrogate(char ch)
Cover the JDK 1.5 API, for convenience.

Parameters: ch the char to check

Returns: true if ch is a high (lead) surrogate

UNKNOWN: ICU 3.0

isIdentifierIgnorable

public static boolean isIdentifierIgnorable(int ch)
Determines if the specified code point should be regarded as an ignorable character in a Unicode identifier. A character is ignorable in the Unicode standard if it is of the type Cf, Formatting code.
Up-to-date Unicode implementation of java.lang.Character.isIdentifierIgnorable().
See UTR #8.

Parameters: ch code point to be determined if it can be ignored in a Unicode identifier.

Returns: true if the code point is ignorable

UNKNOWN: ICU 2.1

isISOControl

public static boolean isISOControl(int ch)
Determines if the specified code point is an ISO control character. A code point is considered to be an ISO control character if it is in the range \u0000 through \u001F or in the range \u007F through \u009F.
Up-to-date Unicode implementation of java.lang.Character.isISOControl()

Parameters: ch code point to determine if it is an ISO control character

Returns: true if code point is a ISO control character

UNKNOWN: ICU 2.1

isJavaIdentifierPart

public static boolean isJavaIdentifierPart(int cp)
Compatibility override of Java method, delegates to java.lang.Character.isJavaIdentifierPart.

Parameters: cp the code point

Returns: true if the code point can continue a java identifier.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

isJavaIdentifierStart

public static boolean isJavaIdentifierStart(int cp)
Compatibility override of Java method, delegates to java.lang.Character.isJavaIdentifierStart.

Parameters: cp the code point

Returns: true if the code point can start a java identifier.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

isJavaLetter

public static boolean isJavaLetter(int cp)

Deprecated: ICU 3.4 (Java)

Compatibility override of Java deprecated method. This method will always remain deprecated. Delegates to java.lang.Character.isJavaIdentifierStart.

Parameters: cp the code point

Returns: true if the code point can start a java identifier.

isJavaLetterOrDigit

public static boolean isJavaLetterOrDigit(int cp)

Deprecated: ICU 3.4 (Java)

Compatibility override of Java deprecated method. This method will always remain deprecated. Delegates to java.lang.Character.isJavaIdentifierPart.

Parameters: cp the code point

Returns: true if the code point can continue a java identifier.

isLegal

public static boolean isLegal(int ch)
A code point is illegal if and only if Note: legal does not mean that it is assigned in this version of Unicode.

Parameters: ch code point to determine if it is a legal code point by itself

Returns: true if and only if legal.

UNKNOWN: ICU 2.1

isLegal

public static boolean isLegal(String str)
A string is legal iff all its code points are legal. A code point is illegal if and only if Note: legal does not mean that it is assigned in this version of Unicode.

Parameters: str containing code points to examin

Returns: true if and only if legal.

UNKNOWN: ICU 2.1

isLetter

public static boolean isLetter(int ch)
Determines if the specified code point is a letter. Up-to-date Unicode implementation of java.lang.Character.isLetter()

Parameters: ch code point to determine if it is a letter

Returns: true if code point is a letter

UNKNOWN: ICU 2.1

isLetterOrDigit

public static boolean isLetterOrDigit(int ch)
Determines if the specified code point is a letter or digit. Note this method, unlike java.lang.Character does not regard the ascii characters 'A' - 'Z' and 'a' - 'z' as digits.

Parameters: ch code point to determine if it is a letter or a digit

Returns: true if code point is a letter or a digit

UNKNOWN: ICU 2.1

isLowerCase

public static boolean isLowerCase(int ch)
Determines if the specified code point is a lowercase character. UnicodeData only contains case mappings for code points where they are one-to-one mappings; it also omits information about context-sensitive case mappings.
For more information about Unicode case mapping please refer to the Technical report #21.
Up-to-date Unicode implementation of java.lang.Character.isLowerCase()

Parameters: ch code point to determine if it is in lowercase

Returns: true if code point is a lowercase character

UNKNOWN: ICU 2.1

isLowSurrogate

public static boolean isLowSurrogate(char ch)
Cover the JDK 1.5 API, for convenience.

Parameters: ch the char to check

Returns: true if ch is a low (trail) surrogate

UNKNOWN: ICU 3.0

isMirrored

public static boolean isMirrored(int ch)
Determines whether the code point has the "mirrored" property. This property is set for characters that are commonly used in Right-To-Left contexts and need to be displayed with a "mirrored" glyph.

Parameters: ch code point whose mirror is to be determined

Returns: true if the code point has the "mirrored" property

UNKNOWN: ICU 2.1

isPrintable

public static boolean isPrintable(int ch)
Determines whether the specified code point is a printable character according to the Unicode standard.

Parameters: ch code point to be determined if it is printable

Returns: true if the code point is a printable character

UNKNOWN: ICU 2.1

isSpace

public static boolean isSpace(int ch)

Deprecated: ICU 3.4 (Java)

Compatibility override of Java deprecated method. This method will always remain deprecated. Delegates to java.lang.Character.isSpace.

Parameters: ch the code point

Returns: true if the code point is a space character as defined by java.lang.Character.isSpace.

isSpaceChar

public static boolean isSpaceChar(int ch)
Determines if the specified code point is a Unicode specified space character, i.e. if code point is in the category Zs, Zl and Zp. Up-to-date Unicode implementation of java.lang.Character.isSpaceChar().

Parameters: ch code point to determine if it is a space

Returns: true if the specified code point is a space character

UNKNOWN: ICU 2.1

isSupplementary

public static boolean isSupplementary(int ch)
Determines if the code point is a supplementary character. A code point is a supplementary character if and only if it is greater than SUPPLEMENTARY_MIN_VALUE

Parameters: ch code point to be determined if it is in the supplementary plane

Returns: true if code point is a supplementary character

UNKNOWN: ICU 2.1

isSupplementaryCodePoint

public static final boolean isSupplementaryCodePoint(int cp)
Cover the JDK 1.5 API, for convenience.

Parameters: cp the code point to check

Returns: true if cp is a supplementary code point

UNKNOWN: ICU 3.0

isSurrogatePair

public static final boolean isSurrogatePair(char high, char low)
Cover the JDK 1.5 API, for convenience. Return true if the chars form a valid surrogate pair.

Parameters: high the high (lead) char low the low (trail) char

Returns: true if high, low form a surrogate pair

UNKNOWN: ICU 3.0

isTitleCase

public static boolean isTitleCase(int ch)
Determines if the specified code point is a titlecase character. UnicodeData only contains case mappings for code points where they are one-to-one mappings; it also omits information about context-sensitive case mappings.
For more information about Unicode case mapping please refer to the Technical report #21.
Up-to-date Unicode implementation of java.lang.Character.isTitleCase().

Parameters: ch code point to determine if it is in title case

Returns: true if the specified code point is a titlecase character

UNKNOWN: ICU 2.1

isUAlphabetic

public static boolean isUAlphabetic(int ch)

Check if a code point has the Alphabetic Unicode property.

Same as UCharacter.hasBinaryProperty(ch, UProperty.ALPHABETIC).

Different from UCharacter.isLetter(ch)!

Parameters: ch codepoint to be tested

UNKNOWN: ICU 2.6

isULowercase

public static boolean isULowercase(int ch)

Check if a code point has the Lowercase Unicode property.

Same as UCharacter.hasBinaryProperty(ch, UProperty.LOWERCASE).

This is different from UCharacter.isLowerCase(ch)!

Parameters: ch codepoint to be tested

UNKNOWN: ICU 2.6

isUnicodeIdentifierPart

public static boolean isUnicodeIdentifierPart(int ch)
Determines if the specified code point may be any part of a Unicode identifier other than the starting character. A code point may be part of a Unicode identifier if and only if it is one of the following: Up-to-date Unicode implementation of java.lang.Character.isUnicodeIdentifierPart().
See UTR #8.

Parameters: ch code point to determine if is can be part of a Unicode identifier

Returns: true if code point is any character belonging a unicode identifier suffix after the first character

UNKNOWN: ICU 2.1

isUnicodeIdentifierStart

public static boolean isUnicodeIdentifierStart(int ch)
Determines if the specified code point is permissible as the first character in a Unicode identifier. A code point may start a Unicode identifier if it is of type either Up-to-date Unicode implementation of java.lang.Character.isUnicodeIdentifierStart().
See UTR #8.

Parameters: ch code point to determine if it can start a Unicode identifier

Returns: true if code point is the first character belonging a unicode identifier

UNKNOWN: ICU 2.1

isUpperCase

public static boolean isUpperCase(int ch)
Determines if the specified code point is an uppercase character. UnicodeData only contains case mappings for code point where they are one-to-one mappings; it also omits information about context-sensitive case mappings.
For language specific case conversion behavior, use toUpperCase(locale, str).
For example, the case conversion for dot-less i and dotted I in Turkish, or for final sigma in Greek. For more information about Unicode case mapping please refer to the Technical report #21.
Up-to-date Unicode implementation of java.lang.Character.isUpperCase().

Parameters: ch code point to determine if it is in uppercase

Returns: true if the code point is an uppercase character

UNKNOWN: ICU 2.1

isUUppercase

public static boolean isUUppercase(int ch)

Check if a code point has the Uppercase Unicode property.

Same as UCharacter.hasBinaryProperty(ch, UProperty.UPPERCASE).

This is different from UCharacter.isUpperCase(ch)!

Parameters: ch codepoint to be tested

UNKNOWN: ICU 2.6

isUWhiteSpace

public static boolean isUWhiteSpace(int ch)

Check if a code point has the White_Space Unicode property.

Same as UCharacter.hasBinaryProperty(ch, UProperty.WHITE_SPACE).

This is different from both UCharacter.isSpace(ch) and UCharacter.isWhitespace(ch)!

Parameters: ch codepoint to be tested

UNKNOWN: ICU 2.6

isValidCodePoint

public static final boolean isValidCodePoint(int cp)
Cover the JDK 1.5 API, for convenience.

Parameters: cp the code point to check

Returns: true if cp is a valid code point

UNKNOWN: ICU 3.0

isWhitespace

public static boolean isWhitespace(int ch)
Determines if the specified code point is a white space character. A code point is considered to be an whitespace character if and only if it satisfies one of the following criteria: This API tries to synch to the semantics of the Java API, java.lang.Character.isWhitespace().

Parameters: ch code point to determine if it is a white space

Returns: true if the specified code point is a white space character

UNKNOWN: ICU 2.1

offsetByCodePoints

public static int offsetByCodePoints(CharSequence text, int index, int codePointOffset)
Cover the JDK API, for convenience. Adjust the char index by a code point offset.

Parameters: text the characters to check index the index to adjust codePointOffset the number of code points by which to offset the index

Returns: the adjusted index

UNKNOWN: ICU 3.0

offsetByCodePoints

public static int offsetByCodePoints(char[] text, int start, int count, int index, int codePointOffset)
Cover the JDK API, for convenience. Adjust the char index by a code point offset.

Parameters: text the characters to check start the start of the range to check count the length of the range to check index the index to adjust codePointOffset the number of code points by which to offset the index

Returns: the adjusted index

UNKNOWN: ICU 3.0

toChars

public static final int toChars(int cp, char[] dst, int dstIndex)
Cover the JDK 1.5 API, for convenience. Writes the chars representing the code point into the destination at the given index.

Parameters: cp the code point to convert dst the destination array into which to put the char(s) representing the code point dstIndex the index at which to put the first (or only) char

Returns: the count of the number of chars written (1 or 2)

Throws: IllegalArgumentException if cp is not a valid code point

UNKNOWN: ICU 3.0

toChars

public static final char[] toChars(int cp)
Cover the JDK 1.5 API, for convenience. Returns a char array representing the code point.

Parameters: cp the code point to convert

Returns: an array containing the char(s) representing the code point

Throws: IllegalArgumentException if cp is not a valid code point

UNKNOWN: ICU 3.0

toCodePoint

public static final int toCodePoint(char high, char low)
Cover the JDK 1.5 API, for convenience. Return the code point represented by the characters. This does not check the surrogate pair for validity.

Parameters: high the high (lead) surrogate low the low (trail) surrogate

Returns: the code point formed by the surrogate pair

UNKNOWN: ICU 3.0

toLowerCase

public static int toLowerCase(int ch)
The given code point is mapped to its lowercase equivalent; if the code point has no lowercase equivalent, the code point itself is returned. Up-to-date Unicode implementation of java.lang.Character.toLowerCase()

This function only returns the simple, single-code point case mapping. Full case mappings should be used whenever possible because they produce better results by working on whole strings. They take into account the string context and the language and can map to a result string with a different length as appropriate. Full case mappings are applied by the case mapping functions that take String parameters rather than code points (int). See also the User Guide chapter on C/POSIX migration: http://icu.sourceforge.net/userguide/posix.html#case_mappings

Parameters: ch code point whose lowercase equivalent is to be retrieved

Returns: the lowercase equivalent code point

UNKNOWN: ICU 2.1

toLowerCase

public static String toLowerCase(String str)
Gets lowercase version of the argument string. Casing is dependent on the default locale and context-sensitive

Parameters: str source string to be performed on

Returns: lowercase version of the argument string

UNKNOWN: ICU 2.1

toLowerCase

public static String toLowerCase(Locale locale, String str)
Gets lowercase version of the argument string. Casing is dependent on the argument locale and context-sensitive

Parameters: locale which string is to be converted in str source string to be performed on

Returns: lowercase version of the argument string

UNKNOWN: ICU 2.1

toLowerCase

public static String toLowerCase(ULocale locale, String str)
Gets lowercase version of the argument string. Casing is dependent on the argument locale and context-sensitive

Parameters: locale which string is to be converted in str source string to be performed on

Returns: lowercase version of the argument string

UNKNOWN: ICU 3.2 This API might change or be removed in a future release.

toString

public static String toString(int ch)
Converts argument code point and returns a String object representing the code point's value in UTF16 format. The result is a string whose length is 1 for non-supplementary code points, 2 otherwise.
com.ibm.ibm.icu.UTF16 can be used to parse Strings generated by this function.
Up-to-date Unicode implementation of java.lang.Character.toString()

Parameters: ch code point

Returns: string representation of the code point, null if code point is not defined in unicode

UNKNOWN: ICU 2.1

toTitleCase

public static int toTitleCase(int ch)
Converts the code point argument to titlecase. If no titlecase is available, the uppercase is returned. If no uppercase is available, the code point itself is returned. Up-to-date Unicode implementation of java.lang.Character.toTitleCase()

This function only returns the simple, single-code point case mapping. Full case mappings should be used whenever possible because they produce better results by working on whole strings. They take into account the string context and the language and can map to a result string with a different length as appropriate. Full case mappings are applied by the case mapping functions that take String parameters rather than code points (int). See also the User Guide chapter on C/POSIX migration: http://icu.sourceforge.net/userguide/posix.html#case_mappings

Parameters: ch code point whose title case is to be retrieved

Returns: titlecase code point

UNKNOWN: ICU 2.1

toTitleCase

public static String toTitleCase(String str, BreakIterator breakiter)

Gets the titlecase version of the argument string.

Position for titlecasing is determined by the argument break iterator, hence the user can customized his break iterator for a specialized titlecasing. In this case only the forward iteration needs to be implemented. If the break iterator passed in is null, the default Unicode algorithm will be used to determine the titlecase positions.

Only positions returned by the break iterator will be title cased, character in between the positions will all be in lower case.

Casing is dependent on the default locale and context-sensitive

Parameters: str source string to be performed on breakiter break iterator to determine the positions in which the character should be title cased.

Returns: lowercase version of the argument string

UNKNOWN: ICU 2.6

toTitleCase

public static String toTitleCase(Locale locale, String str, BreakIterator breakiter)

Gets the titlecase version of the argument string.

Position for titlecasing is determined by the argument break iterator, hence the user can customized his break iterator for a specialized titlecasing. In this case only the forward iteration needs to be implemented. If the break iterator passed in is null, the default Unicode algorithm will be used to determine the titlecase positions.

Only positions returned by the break iterator will be title cased, character in between the positions will all be in lower case.

Casing is dependent on the argument locale and context-sensitive

Parameters: locale which string is to be converted in str source string to be performed on breakiter break iterator to determine the positions in which the character should be title cased.

Returns: lowercase version of the argument string

UNKNOWN: ICU 2.6

toTitleCase

public static String toTitleCase(ULocale locale, String str, BreakIterator titleIter)

Gets the titlecase version of the argument string.

Position for titlecasing is determined by the argument break iterator, hence the user can customized his break iterator for a specialized titlecasing. In this case only the forward iteration needs to be implemented. If the break iterator passed in is null, the default Unicode algorithm will be used to determine the titlecase positions.

Only positions returned by the break iterator will be title cased, character in between the positions will all be in lower case.

Casing is dependent on the argument locale and context-sensitive

Parameters: locale which string is to be converted in str source string to be performed on titleIter break iterator to determine the positions in which the character should be title cased.

Returns: lowercase version of the argument string

UNKNOWN: ICU 3.2 This API might change or be removed in a future release.

toUpperCase

public static int toUpperCase(int ch)
Converts the character argument to uppercase. If no uppercase is available, the character itself is returned. Up-to-date Unicode implementation of java.lang.Character.toUpperCase()

This function only returns the simple, single-code point case mapping. Full case mappings should be used whenever possible because they produce better results by working on whole strings. They take into account the string context and the language and can map to a result string with a different length as appropriate. Full case mappings are applied by the case mapping functions that take String parameters rather than code points (int). See also the User Guide chapter on C/POSIX migration: http://icu.sourceforge.net/userguide/posix.html#case_mappings

Parameters: ch code point whose uppercase is to be retrieved

Returns: uppercase code point

UNKNOWN: ICU 2.1

toUpperCase

public static String toUpperCase(String str)
Gets uppercase version of the argument string. Casing is dependent on the default locale and context-sensitive.

Parameters: str source string to be performed on

Returns: uppercase version of the argument string

UNKNOWN: ICU 2.1

toUpperCase

public static String toUpperCase(Locale locale, String str)
Gets uppercase version of the argument string. Casing is dependent on the argument locale and context-sensitive.

Parameters: locale which string is to be converted in str source string to be performed on

Returns: uppercase version of the argument string

UNKNOWN: ICU 2.1

toUpperCase

public static String toUpperCase(ULocale locale, String str)
Gets uppercase version of the argument string. Casing is dependent on the argument locale and context-sensitive.

Parameters: locale which string is to be converted in str source string to be performed on

Returns: uppercase version of the argument string

UNKNOWN: ICU 3.2 This API might change or be removed in a future release.

Copyright (c) 2007 IBM Corporation and others.