public class Utf8 extends Object
There are several variants of UTF-8. The one implemented by this class is the restricted definition of UTF-8 introduced in Unicode 3.1, which mandates the rejection of "overlong" byte sequences as well as rejection of 3-byte surrogate codepoint byte sequences. Note that the UTF-8 decoder included in Oracle's JDK has been modified to also reject "overlong" byte sequences, but (as of 2011) still accepts 3-byte surrogate codepoint byte sequences.
The byte sequences considered valid by this class are exactly those that can be roundtrip converted to Strings and back to bytes using the UTF-8 charset, without loss:
Arrays.equals(bytes, new String(bytes, Internal.UTF_8).getBytes(Internal.UTF_8))
See the Unicode Standard, Table 3-6. UTF-8 Bit Distribution, Table 3-7. Well Formed UTF-8 Byte Sequences.
This class supports decoding of partial byte sequences, so that the
bytes in a complete UTF-8 byte sequences can be stored in multiple
segments. Methods typically return MALFORMED if the partial
byte sequence is definitely not well-formed, COMPLETE if it is
well-formed in the absence of additional input, or if the byte sequence
apparently terminated in the middle of a character, an opaque integer
"state" value containing enough information to decode the character when
passed to a subsequent invocation of a partial decoding method.
| 限定符和类型 | 字段和说明 |
|---|---|
static int |
COMPLETE
State value indicating that the byte sequence is well-formed and
complete (no further bytes are needed to complete a character).
|
static int |
MALFORMED
State value indicating that the byte sequence is definitely not
well-formed.
|
| 构造器和说明 |
|---|
Utf8() |
| 限定符和类型 | 方法和说明 |
|---|---|
static boolean |
isValidUtf8(byte[] bytes)
Returns
true if the given byte array is a well-formed
UTF-8 byte sequence. |
static boolean |
isValidUtf8(byte[] bytes,
int index,
int limit)
Returns
true if the given byte array slice is a
well-formed UTF-8 byte sequence. |
static int |
partialIsValidUtf8(int state,
byte[] bytes,
int index,
int limit)
Tells whether the given byte array slice is a well-formed,
malformed, or incomplete UTF-8 byte sequence.
|
public static final int COMPLETE
public static final int MALFORMED
public static boolean isValidUtf8(byte[] bytes)
true if the given byte array is a well-formed
UTF-8 byte sequence.
This is a convenience method, equivalent to a call to isValidUtf8(bytes, 0, bytes.length).
public static boolean isValidUtf8(byte[] bytes,
int index,
int limit)
true if the given byte array slice is a
well-formed UTF-8 byte sequence. The range of bytes to be
checked extends from index index, inclusive, to limit, exclusive.
This is a convenience method, equivalent to partialIsValidUtf8(bytes, index, limit) == Utf8.COMPLETE.
public static int partialIsValidUtf8(int state,
byte[] bytes,
int index,
int limit)
index, inclusive, to
limit, exclusive.state - either COMPLETE (if this is the initial decoding
operation) or the value returned from a call to a partial decoding method
for the previous bytesMALFORMED if the partial byte sequence is
definitely not well-formed, COMPLETE if it is well-formed
(no additional input needed), or if the byte sequence is
"incomplete", i.e. apparently terminated in the middle of a character,
an opaque integer "state" value containing enough information to
decode the character when passed to a subsequent invocation of a
partial decoding method.Copyright © 2019. All rights reserved.