public final class UnicodeUtil
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
static BytesRef |
BIG_TERM
A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms
(e.g.
|
private static long |
HALF_MASK |
private static long |
HALF_SHIFT |
private static int |
LEAD_SURROGATE_MIN_VALUE
Lead surrogate minimum value
|
private static int |
LEAD_SURROGATE_OFFSET_
Value that all lead surrogate starts with
|
private static int |
LEAD_SURROGATE_SHIFT_
Shift value for lead surrogate to form a supplementary character.
|
static int |
MAX_UTF8_BYTES_PER_CHAR
Maximum number of UTF8 bytes per UTF16 character.
|
private static int |
SUPPLEMENTARY_MIN_VALUE
The minimum value for Supplementary code points
|
private static int |
SURROGATE_OFFSET |
private static int |
TRAIL_SURROGATE_MASK_
Mask to retrieve the significant value from a trail surrogate.
|
private static int |
TRAIL_SURROGATE_MIN_VALUE
Trail surrogate minimum value
|
private static long |
UNI_MAX_BMP |
static int |
UNI_REPLACEMENT_CHAR |
static int |
UNI_SUR_HIGH_END |
static int |
UNI_SUR_HIGH_START |
static int |
UNI_SUR_LOW_END |
static int |
UNI_SUR_LOW_START |
(package private) static int[] |
utf8CodeLength |
Modifier | Constructor and Description |
---|---|
private |
UnicodeUtil() |
Modifier and Type | Method and Description |
---|---|
static int |
calcUTF16toUTF8Length(java.lang.CharSequence s,
int offset,
int len)
Calculates the number of UTF8 bytes necessary to write a UTF16 string.
|
static int |
codePointCount(BytesRef utf8)
Returns the number of code points in this UTF8 sequence.
|
static int |
maxUTF8Length(int utf16Length)
Returns the maximum number of utf8 bytes required to encode a utf16 (e.g., java char[], String)
|
static java.lang.String |
newString(int[] codePoints,
int offset,
int count)
Cover JDK 1.5 API.
|
static java.lang.String |
toHexString(java.lang.String s) |
static int |
UTF16toUTF8(char[] source,
int offset,
int length,
byte[] out)
Encode characters from a char[] source, starting at
offset for length chars.
|
static int |
UTF16toUTF8(java.lang.CharSequence s,
int offset,
int length,
byte[] out)
Encode characters from this String, starting at offset
for length characters.
|
static int |
UTF16toUTF8(java.lang.CharSequence s,
int offset,
int length,
byte[] out,
int outOffset)
Encode characters from this String, starting at offset
for length characters.
|
static int |
UTF8toUTF16(byte[] utf8,
int offset,
int length,
char[] out)
Interprets the given byte array as UTF-8 and converts to UTF-16.
|
static int |
UTF8toUTF16(BytesRef bytesRef,
char[] chars)
Utility method for
UTF8toUTF16(byte[], int, int, char[]) |
static int |
UTF8toUTF32(BytesRef utf8,
int[] ints)
This method assumes valid UTF8 input.
|
static boolean |
validUTF16String(char[] s,
int size) |
static boolean |
validUTF16String(java.lang.CharSequence s) |
public static final BytesRef BIG_TERM
WARNING: This is not a valid UTF8 Term
public static final int UNI_SUR_HIGH_START
public static final int UNI_SUR_HIGH_END
public static final int UNI_SUR_LOW_START
public static final int UNI_SUR_LOW_END
public static final int UNI_REPLACEMENT_CHAR
private static final long UNI_MAX_BMP
private static final long HALF_SHIFT
private static final long HALF_MASK
private static final int SURROGATE_OFFSET
public static final int MAX_UTF8_BYTES_PER_CHAR
static final int[] utf8CodeLength
private static final int LEAD_SURROGATE_SHIFT_
private static final int TRAIL_SURROGATE_MASK_
private static final int TRAIL_SURROGATE_MIN_VALUE
private static final int LEAD_SURROGATE_MIN_VALUE
private static final int SUPPLEMENTARY_MIN_VALUE
private static final int LEAD_SURROGATE_OFFSET_
public static int UTF16toUTF8(char[] source, int offset, int length, byte[] out)
public static int UTF16toUTF8(java.lang.CharSequence s, int offset, int length, byte[] out)
public static int UTF16toUTF8(java.lang.CharSequence s, int offset, int length, byte[] out, int outOffset)
outOffset
. It is the responsibility of the
caller to make sure that the destination array is large enough.
note this method returns the final output offset (outOffset + number of bytes written)
public static int calcUTF16toUTF8Length(java.lang.CharSequence s, int offset, int len)
public static boolean validUTF16String(java.lang.CharSequence s)
public static boolean validUTF16String(char[] s, int size)
public static int codePointCount(BytesRef utf8)
This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).
java.lang.IllegalArgumentException
- If invalid codepoint header byte occurs or the
content is prematurely truncated.public static int UTF8toUTF32(BytesRef utf8, int[] ints)
This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped). It is the responsibility of the caller to make sure that the destination array is large enough.
java.lang.IllegalArgumentException
- If invalid codepoint header byte occurs or the
content is prematurely truncated.public static java.lang.String newString(int[] codePoints, int offset, int count)
codePoints
- The code arrayoffset
- The start of the text in the code point arraycount
- The number of code pointsjava.lang.IllegalArgumentException
- If an invalid code point is encounteredjava.lang.IndexOutOfBoundsException
- If the offset or count are out of bounds.public static java.lang.String toHexString(java.lang.String s)
public static int UTF8toUTF16(byte[] utf8, int offset, int length, char[] out)
NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.
public static int maxUTF8Length(int utf16Length)
public static int UTF8toUTF16(BytesRef bytesRef, char[] chars)
UTF8toUTF16(byte[], int, int, char[])
UTF8toUTF16(byte[], int, int, char[])