public class ModifiedUTF8Charset extends BaseCharset
Charset representing "Modified UTF-8". Java originally used 2 byte char primitives to store characters in its Strings. These were originally encoded in UCS2 -- which let Java natively support ~65K characters in Unicode. In Java 5, UCS2 is no longer used -- UTF-16 is now used. This let's Java natively support the entire range of Unicode characters which can be > 65K. For higher range UTF-16 characters with a Java char value of (> 0x7FFF), this charset does NOT correctly encode these values to the correct UTF-8 byte sequence.
Its usually quite uncommon in most situations to actually use a character value > 0x7FFF. This is why this charset exists -- it takes advantage of this property to speed up UTF-8 encoding/decoding of byte arrays. If you decide to solely use this charset for serialization, you also don't risk any issues with encoding/decoding since the resulting Java String will always be the same as if you actually used UTF-8.
This charset turns out to be very useful for directly encoding/decoding from byte arrays (especially if the byte array is already allocated), where the default Java classes would force you to create a new byte array. It also is ~30% faster than Java at decoding/encoding in most cases. In some cases it's a little slower. On average though it usually matches Java and has a good chance at being much faster during decoding.
This charset is originally based on much of the work in DataOuputStream.java and DataInputStream.java with a few notable tweaks:
| Constructor and Description |
|---|
ModifiedUTF8Charset() |
| Modifier and Type | Method and Description |
|---|---|
static int |
calculateByteLength(CharSequence charSeq)
Highly efficient method for calculating the byte length of
a String if it was encoded as modified UTF-8 bytes.
|
static int |
calculateByteLength(CharSequence charSeq,
char[] charBuffer,
int charOffset,
int charLength)
Highly efficient method for calculating the byte length of
a String if it was encoded as modified UTF-8 bytes.
|
String |
decode(byte[] bytes)
Default implementation that simply returns a String by creating a new
StringBuffer, appending to it, and then returning a new String.
|
String |
decode(byte[] bytes,
int offset,
int length) |
void |
decode(byte[] bytes,
StringBuilder buffer)
Decode the byte array to a Java string that is appended to the buffer.
|
static int |
decodeToCharArray(byte[] byteBuffer,
int byteOffset,
int byteLength,
char[] charBuffer,
int charOffset) |
byte[] |
encode(CharSequence charSeq)
Encode the Java string into a byte array.
|
static int |
encodeToByteArray(CharSequence charSeq,
char[] charBuffer,
int charOffset,
int charLength,
byte[] byteBuffer,
int byteOffset)
Encode the string to an array of UTF-8 bytes.
|
int |
estimateDecodeCharLength(byte[] bytes) |
int |
estimateEncodeByteLength(CharSequence str0) |
normalizepublic int estimateEncodeByteLength(CharSequence str0)
public int estimateDecodeCharLength(byte[] bytes)
public byte[] encode(CharSequence charSeq)
CharsetcharSeq - The Java string to convert into a byte arraypublic void decode(byte[] bytes,
StringBuilder buffer)
Charsetbytes - The array of bytes to decodebuffer - The String buffer to append chars topublic String decode(byte[] bytes)
BaseCharsetdecode in interface Charsetdecode in class BaseCharsetbytes - The array of bytes to decodepublic String decode(byte[] bytes, int offset, int length)
public static int calculateByteLength(CharSequence charSeq)
charSeq - The character sequence to use for encoding.public static int calculateByteLength(CharSequence charSeq, char[] charBuffer, int charOffset, int charLength)
charSeq - The optional character sequence to use for encoding rather
than the provided character buffer. It is always higher performance
to supply a char array vs. use a CharSequence. Set to null if the
character array is supplied.charBuffer - The source char array to encodecharOffset - The offset in the source char array to start encode fromcharLength - The length from the offset in the source char array to encodepublic static int encodeToByteArray(CharSequence charSeq, char[] charBuffer, int charOffset, int charLength, byte[] byteBuffer, int byteOffset)
charSeq - The optional character sequence to use for encoding rather
than the provided character buffer. It is always higher performance
to supply a char array vs. use a CharSequence. Set to null if the
character array is supplied.charBuffer - The source char array to encodecharOffset - The offset in the source char array to start encode fromcharLength - The length from the offset in the source char array to encodebyteBuffer - The destination byte array to encode tobyteOffset - The offset in the destination byte array to start encode tocalculateByteLength(java.lang.CharSequence)public static int decodeToCharArray(byte[] byteBuffer,
int byteOffset,
int byteLength,
char[] charBuffer,
int charOffset)
Copyright © 2012-2014 Cloudhopper by Twitter. All Rights Reserved.