Class DataUtilities
- java.lang.Object
-
- com.upokecenter.util.DataUtilities
-
public final class DataUtilities extends java.lang.ObjectContains methods useful for reading and writing text strings. It is designed to have no dependencies other than the basic runtime class library.Many of these methods work with text encoded in UTF-8, an encoding form of the Unicode Standard which uses one byte to encode the most basic characters and two to four bytes to encode other characters. For example, the
GetUtf8method converts a text string to an array of bytes in UTF-8.In C# and Java, text strings are represented as sequences of 16-bit values called
chars. These sequences are well-formed under UTF-16, a 16-bit encoding form of Unicode, except if they contain unpaired surrogate code points. (A surrogate code point is used to encode supplementary characters, those with code points U+10000 or higher, in UTF-16. A surrogate pair is a high surrogate, U+D800 to U+DBFF, followed by a low surrogate, U+DC00 to U+DFFF. An unpaired surrogate code point is a surrogate not appearing in a surrogate pair.) Many of the methods in this class allow setting the behavior to follow when unpaired surrogate code points are found in text strings, such as throwing an error or treating the unpaired surrogate as a replacement character (U+FFFD).
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static intCodePointAt(java.lang.String str, int index)Gets the Unicode code point at the given index of the string.static intCodePointAt(java.lang.String str, int index, int surrogateBehavior)Gets the Unicode code point at the given index of the string.static intCodePointBefore(java.lang.String str, int index)Gets the Unicode code point just before the given index of the string.static intCodePointBefore(java.lang.String str, int index, int surrogateBehavior)Gets the Unicode code point just before the given index of the string.static intCodePointCompare(java.lang.String strA, java.lang.String strB)Compares two strings in Unicode code point order.static intCodePointLength(java.lang.String str)Finds the number of Unicode code points in the given text string.static byte[]GetUtf8Bytes(java.lang.String str, boolean replace)Encodes a string in UTF-8 as a byte array.static byte[]GetUtf8Bytes(java.lang.String str, boolean replace, boolean lenientLineBreaks)Encodes a string in UTF-8 as a byte array.static longGetUtf8Length(java.lang.String str, boolean replace)Calculates the number of bytes needed to encode a string in UTF-8.static java.lang.StringGetUtf8String(byte[] bytes, boolean replace)Generates a text string from a UTF-8 byte array.static java.lang.StringGetUtf8String(byte[] bytes, int offset, int bytesCount, boolean replace)Generates a text string from a portion of a UTF-8 byte array.static intReadUtf8(java.io.InputStream stream, int bytesCount, java.lang.StringBuilder builder, boolean replace)Reads a string in UTF-8 encoding from a data stream.static intReadUtf8FromBytes(byte[] data, int offset, int bytesCount, java.lang.StringBuilder builder, boolean replace)Reads a string in UTF-8 encoding from a byte array.static java.lang.StringReadUtf8ToString(java.io.InputStream stream)Reads a string in UTF-8 encoding from a data stream in full and returns that string.static java.lang.StringReadUtf8ToString(java.io.InputStream stream, int bytesCount, boolean replace)Reads a string in UTF-8 encoding from a data stream and returns that string.static java.lang.StringToLowerCaseAscii(java.lang.String str)Returns a string with the basic upper-case letters A to Z (U+0041 to U+005A) converted to the corresponding basic lower-case letters.static java.lang.StringToUpperCaseAscii(java.lang.String str)Returns a string with the basic lower-case letters A to Z (U+0061 to U+007A) converted to the corresponding basic upper-case letters.static intWriteUtf8(java.lang.String str, int offset, int length, java.io.OutputStream stream, boolean replace)Writes a portion of a string in UTF-8 encoding to a data stream.static intWriteUtf8(java.lang.String str, int offset, int length, java.io.OutputStream stream, boolean replace, boolean lenientLineBreaks)Writes a portion of a string in UTF-8 encoding to a data stream.static intWriteUtf8(java.lang.String str, java.io.OutputStream stream, boolean replace)Writes a string in UTF-8 encoding to a data stream.
-
-
-
Method Detail
-
GetUtf8String
public static java.lang.String GetUtf8String(byte[] bytes, boolean replace)Generates a text string from a UTF-8 byte array.- Parameters:
bytes- A byte array containing text encoded in UTF-8.replace- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when invalid UTF-8 is seen.- Returns:
- A string represented by the UTF-8 byte array.
- Throws:
java.lang.NullPointerException- The parameterbytesis null.java.lang.IllegalArgumentException- The string is not valid UTF-8 andreplaceis false.
-
CodePointLength
public static int CodePointLength(java.lang.String str)
Finds the number of Unicode code points in the given text string. Unpaired surrogate code points increase this number by 1. This is not necessarily the length of the string in "char" s.- Parameters:
str- The parameterstris a text string.- Returns:
- The number of Unicode code points in the given string.
- Throws:
java.lang.NullPointerException- The parameterstris null.
-
GetUtf8String
public static java.lang.String GetUtf8String(byte[] bytes, int offset, int bytesCount, boolean replace)Generates a text string from a portion of a UTF-8 byte array.- Parameters:
bytes- A byte array containing text encoded in UTF-8.offset- Offset into the byte array to start reading.bytesCount- Length, in bytes, of the UTF-8 text string.replace- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when invalid UTF-8 is seen.- Returns:
- A string represented by the UTF-8 byte array.
- Throws:
java.lang.NullPointerException- The parameterbytesis null.java.lang.IllegalArgumentException- The portion of the byte array is not valid UTF-8 andreplaceis false.java.lang.IllegalArgumentException- The parameteroffsetis less than 0,bytesCountis less than 0, or offset plus bytesCount is greater than the length of "data" .
-
GetUtf8Bytes
public static byte[] GetUtf8Bytes(java.lang.String str, boolean replace)Encodes a string in UTF-8 as a byte array. This method does not insert a byte-order mark (U+FEFF) at the beginning of the encoded byte array.
REMARK: It is not recommended to use
Encoding.UTF8.GetBytesin.getNET(), or thegetBytes()method in Java to do this. For instance,getBytes()encodes text strings in a default (so not fixed) character encoding, which can be undesirable.- Parameters:
str- The parameterstris a text string.replace- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.- Returns:
- The string encoded in UTF-8.
- Throws:
java.lang.NullPointerException- The parameterstris null.java.lang.IllegalArgumentException- The string contains an unpaired surrogate code point andreplaceis false, or an internal error occurred.
-
GetUtf8Bytes
public static byte[] GetUtf8Bytes(java.lang.String str, boolean replace, boolean lenientLineBreaks)Encodes a string in UTF-8 as a byte array. This method does not insert a byte-order mark (U+FEFF) at the beginning of the encoded byte array.
REMARK: It is not recommended to use
Encoding.UTF8.GetBytesin.getNET(), or thegetBytes()method in Java to do this. For instance,getBytes()encodes text strings in a default (so not fixed) character encoding, which can be undesirable.- Parameters:
str- The parameterstris a text string.replace- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.lenientLineBreaks- If true, replaces carriage return (CR) not followed by line feed (LF) and LF not preceded by CR with CR-LF pairs.- Returns:
- The string encoded in UTF-8.
- Throws:
java.lang.NullPointerException- The parameterstris null.java.lang.IllegalArgumentException- The string contains an unpaired surrogate code point andreplaceis false, or an internal error occurred.
-
GetUtf8Length
public static long GetUtf8Length(java.lang.String str, boolean replace)Calculates the number of bytes needed to encode a string in UTF-8.- Parameters:
str- The parameterstris a text string.replace- If true, treats unpaired surrogate code points as having 3 UTF-8 bytes (the UTF-8 length of the replacement character U+FFFD).- Returns:
- The number of bytes needed to encode the given string in UTF-8, or
-1 if the string contains an unpaired surrogate code point and
replaceis false. - Throws:
java.lang.NullPointerException- The parameterstris null.
-
CodePointBefore
public static int CodePointBefore(java.lang.String str, int index)Gets the Unicode code point just before the given index of the string.- Parameters:
str- The parameterstris a text string.index- Index of the current position into the string.- Returns:
- The Unicode code point at the previous position. Returns -1 if
indexis 0 or less, or is greater than the string's length. Returns the replacement character (U+FFFD) if the code point at the previous position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units. - Throws:
java.lang.NullPointerException- The parameterstris null.
-
CodePointBefore
public static int CodePointBefore(java.lang.String str, int index, int surrogateBehavior)Gets the Unicode code point just before the given index of the string.- Parameters:
str- The parameterstris a text string.index- Index of the current position into the string.surrogateBehavior- Specifies what kind of value to return if the previous code point is an unpaired surrogate code point: if 0, return the replacement character (U+FFFD); if 1, return the value of the surrogate code point; if neither 0 nor 1, return -1.- Returns:
- The Unicode code point at the previous position. Returns -1 if
indexis 0 or less, or is greater than the string's length. Returns a value as specified undersurrogateBehaviorif the code point at the previous position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units. - Throws:
java.lang.NullPointerException- The parameterstris null.
-
CodePointAt
public static int CodePointAt(java.lang.String str, int index)Gets the Unicode code point at the given index of the string.- Parameters:
str- The parameterstris a text string.index- Index of the current position into the string.- Returns:
- The Unicode code point at the given position. Returns -1 if
indexis 0 or less, or is greater than the string's length. Returns the replacement character (U+FFFD) if the code point at that position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units. - Throws:
java.lang.NullPointerException- The parameterstris null.
-
CodePointAt
public static int CodePointAt(java.lang.String str, int index, int surrogateBehavior)Gets the Unicode code point at the given index of the string.The following example shows how to iterate a text string code point by code point, terminating the loop when an unpaired surrogate is found.
for (int i = 0;i<str.length(); ++i) { int codePoint = DataUtilities.CodePointAt(str, i, 2); if (codePoint < 0) { break; /* Unpaired surrogate */ } System.out.println("codePoint:"+codePoint); if (codePoint >= 0x10000) { i++; /* Supplementary code point */ } }.- Parameters:
str- The parameterstris a text string.index- Index of the current position into the string.surrogateBehavior- Specifies what kind of value to return if the code point at the given index is an unpaired surrogate code point: if 0, return the replacement character (U+FFFD); if 1, return the value of the surrogate code point; if neither 0 nor 1, return -1.- Returns:
- The Unicode code point at the given position. Returns -1 if
indexis 0 or less, or is greater than the string's length. Returns a value as specified undersurrogateBehaviorif the code point at that position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units. - Throws:
java.lang.NullPointerException- The parameterstris null.
-
ToLowerCaseAscii
public static java.lang.String ToLowerCaseAscii(java.lang.String str)
Returns a string with the basic upper-case letters A to Z (U+0041 to U+005A) converted to the corresponding basic lower-case letters. Other characters remain unchanged.- Parameters:
str- The parameterstris a text string.- Returns:
- The converted string, or null if
stris null.
-
ToUpperCaseAscii
public static java.lang.String ToUpperCaseAscii(java.lang.String str)
Returns a string with the basic lower-case letters A to Z (U+0061 to U+007A) converted to the corresponding basic upper-case letters. Other characters remain unchanged.- Parameters:
str- The parameterstris a text string.- Returns:
- The converted string, or null if
stris null.
-
CodePointCompare
public static int CodePointCompare(java.lang.String strA, java.lang.String strB)Compares two strings in Unicode code point order. Unpaired surrogate code points are treated as individual code points.- Parameters:
strA- The first string. Can be null.strB- The second string. Can be null.- Returns:
- A value indicating which string is " less" or " greater" . 0: Both strings are equal or null. Less than 0: a is null and b isn't; or the first code point that's different is less in A than in B; or b starts with a and is longer than a. Greater than 0: b is null and a isn't; or the first code point that's different is greater in A than in B; or a starts with b and is longer than b.
-
WriteUtf8
public static int WriteUtf8(java.lang.String str, int offset, int length, java.io.OutputStream stream, boolean replace) throws java.io.IOExceptionWrites a portion of a string in UTF-8 encoding to a data stream.- Parameters:
str- A string to write.offset- The Index starting at 0 where the string portion to write begins.length- The length of the string portion to write.stream- A writable data stream.replace- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.- Returns:
- 0 if the entire string portion was written; or -1 if the string
portion contains an unpaired surrogate code point and
replaceis false. - Throws:
java.lang.NullPointerException- The parameterstris null orstreamis null.java.io.IOException- An I/O error occurred.java.lang.IllegalArgumentException- Eitheroffsetorlengthis less than 0 or greater thanstr's length, orstr's length minusoffsetis less thanlength.
-
WriteUtf8
public static int WriteUtf8(java.lang.String str, int offset, int length, java.io.OutputStream stream, boolean replace, boolean lenientLineBreaks) throws java.io.IOExceptionWrites a portion of a string in UTF-8 encoding to a data stream.- Parameters:
str- A string to write.offset- The Index starting at 0 where the string portion to write begins.length- The length of the string portion to write.stream- A writable data stream.replace- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.lenientLineBreaks- If true, replaces carriage return (CR) not followed by line feed (LF) and LF not preceded by CR with CR-LF pairs.- Returns:
- 0 if the entire string portion was written; or -1 if the string
portion contains an unpaired surrogate code point and
replaceis false. - Throws:
java.lang.NullPointerException- The parameterstris null orstreamis null.java.lang.IllegalArgumentException- The parameteroffsetis less than 0,lengthis less than 0, oroffsetpluslengthis greater than the string's length.java.io.IOException- An I/O error occurred.
-
WriteUtf8
public static int WriteUtf8(java.lang.String str, java.io.OutputStream stream, boolean replace) throws java.io.IOExceptionWrites a string in UTF-8 encoding to a data stream.- Parameters:
str- A string to write.stream- A writable data stream.replace- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.- Returns:
- 0 if the entire string was written; or -1 if the string contains an
unpaired surrogate code point and
replaceis false. - Throws:
java.lang.NullPointerException- The parameterstris null orstreamis null.java.io.IOException- An I/O error occurred.
-
ReadUtf8FromBytes
public static int ReadUtf8FromBytes(byte[] data, int offset, int bytesCount, java.lang.StringBuilder builder, boolean replace)Reads a string in UTF-8 encoding from a byte array.- Parameters:
data- A byte array containing a UTF-8 text string.offset- Offset into the byte array to start reading.bytesCount- Length, in bytes, of the UTF-8 text string.builder- A string builder object where the resulting string will be stored.replace- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when invalid UTF-8 is seen.- Returns:
- 0 if the entire string was read without errors, or -1 if the string
is not valid UTF-8 and
replaceis false. - Throws:
java.lang.NullPointerException- The parameterdatais null orbuilderis null.java.lang.IllegalArgumentException- The parameteroffsetis less than 0,bytesCountis less than 0, or offset plus bytesCount is greater than the length ofdata.
-
ReadUtf8ToString
public static java.lang.String ReadUtf8ToString(java.io.InputStream stream) throws java.io.IOExceptionReads a string in UTF-8 encoding from a data stream in full and returns that string. Replaces invalid encoding with the replacement character (U+FFFD).- Parameters:
stream- A readable data stream.- Returns:
- The string read.
- Throws:
java.io.IOException- An I/O error occurred.java.lang.NullPointerException- The parameterstreamis null.
-
ReadUtf8ToString
public static java.lang.String ReadUtf8ToString(java.io.InputStream stream, int bytesCount, boolean replace) throws java.io.IOExceptionReads a string in UTF-8 encoding from a data stream and returns that string.- Parameters:
stream- A readable data stream.bytesCount- The length, in bytes, of the string. If this is less than 0, this function will read until the end of the stream.replace- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, throws an error if an unpaired surrogate code point is seen.- Returns:
- The string read.
- Throws:
java.io.IOException- An I/O error occurred; or, the string is not valid UTF-8 andreplaceis false.java.lang.NullPointerException- The parameterstreamis null.
-
ReadUtf8
public static int ReadUtf8(java.io.InputStream stream, int bytesCount, java.lang.StringBuilder builder, boolean replace) throws java.io.IOExceptionReads a string in UTF-8 encoding from a data stream.- Parameters:
stream- A readable data stream.bytesCount- The length, in bytes, of the string. If this is less than 0, this function will read until the end of the stream.builder- A string builder object where the resulting string will be stored.replace- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.- Returns:
- 0 if the entire string was read without errors, -1 if the string is
not valid UTF-8 and
replaceis false, or -2 if the end of the stream was reached before the last character was read completely (which is only the case ifbytesCountis 0 or greater). - Throws:
java.io.IOException- An I/O error occurred.java.lang.NullPointerException- The parameterstreamis null orbuilderis null.
-
-