Class UCSReader
- java.lang.Object
-
- java.io.Reader
-
- org.hortonmachine.gears.utils.style.sld.UCSReader
-
- All Implemented Interfaces:
Closeable,AutoCloseable,Readable
public class UCSReader extends Reader
Reader for UCS-2 and UCS-4 encodings. (more precisely ISO-10646-UCS-(2|4) encodings).This variant is modified to handle supplementary Unicode code points correctly. Though this required a lot of new code and definitely reduced the perfomance comparing to original version. I tried my best to preserve exsiting code and comments whenever it was possible. I performed some basic tests, but not too thorough ones, so some bugs may still nest in the code. -AK
- Version:
- $Id$
- Author:
- Neil Graham, IBM
-
-
Field Summary
Fields Modifier and Type Field Description static intCHAR_BUFFER_INITIAL_SIZEStarting size of the internal char buffer.static intDEFAULT_BUFFER_SIZEDefault byte buffer size (8192, larger than that of ASCIIReader since it's reasonable to surmise that the average UCS-4-encoded file should be 4 times as large as the average ASCII-encoded file).protected byte[]fBufferByte buffer.protected char[]fCharBufStores aforeread or "excess" characters that may appear duringreadmethods invocation due to the fact that one input UCS-4 supplementary character results in two output Javachar`s - high surrogate and low surrogate code units.protected intfCharCountCount of Java chars currently being stored in in thefCharBufarray.protected shortfEncodingwhat kind of data we're dealing withprotected InputStreamfInputStreamInput stream.static intMAX_CODE_POINTThe maximum value of a Unicode code point.static intMIN_CODE_POINTThe minimum value of a Unicode code point.static intMIN_SUPPLEMENTARY_CODE_POINTThe minimum value of a supplementary code point.static shortUCS2BEstatic shortUCS2LEstatic shortUCS4BEstatic shortUCS4LE
-
Constructor Summary
Constructors Constructor Description UCSReader(InputStream inputStream, int size, short encoding)Constructs anISO-10646-UCS-(2|4)reader from the source input stream using explicitly specified initial buffer size.UCSReader(InputStream inputStream, short encoding)Constructs anISO-10646-UCS-(2|4)reader from the specified input stream using default buffer size.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()Close the stream.StringgetByteOrder()Returns byte order ("endianness") of the encoding currently in use by this character stream.StringgetEncoding()Returns the encoding currently in use by this character stream.protected booleanisSupplementaryCodePoint(int codePoint)Determines whether the specified character (Unicode code point) is in the supplementary character range.voidmark(int readAheadLimit)Mark the present position in the stream.booleanmarkSupported()Tell whether this stream supports the mark() operation.intread()Read a single character.intread(char[] ch, int offset, int length)Read characters into a portion of an array.protected intreadUCS2(char[] ch, int offset, int length)ReadUCS-2characters into a portion of an array.booleanready()Tell whether this stream is ready to be read.voidreset()Reset the stream.longskip(long n)Skip characters.-
Methods inherited from class java.io.Reader
nullReader, read, read, transferTo
-
-
-
-
Field Detail
-
DEFAULT_BUFFER_SIZE
public static final int DEFAULT_BUFFER_SIZE
Default byte buffer size (8192, larger than that of ASCIIReader since it's reasonable to surmise that the average UCS-4-encoded file should be 4 times as large as the average ASCII-encoded file).- See Also:
- Constant Field Values
-
CHAR_BUFFER_INITIAL_SIZE
public static final int CHAR_BUFFER_INITIAL_SIZE
Starting size of the internal char buffer. Internal char buffer is maintained to hold excess chars that may left from previous read operation when working with UCS-4 data (never used for UCS-2).- See Also:
- Constant Field Values
-
UCS2LE
public static final short UCS2LE
- See Also:
- Constant Field Values
-
UCS2BE
public static final short UCS2BE
- See Also:
- Constant Field Values
-
UCS4LE
public static final short UCS4LE
- See Also:
- Constant Field Values
-
UCS4BE
public static final short UCS4BE
- See Also:
- Constant Field Values
-
MIN_SUPPLEMENTARY_CODE_POINT
public static final int MIN_SUPPLEMENTARY_CODE_POINT
The minimum value of a supplementary code point.- See Also:
- Constant Field Values
-
MIN_CODE_POINT
public static final int MIN_CODE_POINT
The minimum value of a Unicode code point.- See Also:
- Constant Field Values
-
MAX_CODE_POINT
public static final int MAX_CODE_POINT
The maximum value of a Unicode code point.- See Also:
- Constant Field Values
-
fInputStream
protected InputStream fInputStream
Input stream.
-
fBuffer
protected byte[] fBuffer
Byte buffer.
-
fEncoding
protected short fEncoding
what kind of data we're dealing with
-
fCharBuf
protected char[] fCharBuf
Stores aforeread or "excess" characters that may appear duringreadmethods invocation due to the fact that one input UCS-4 supplementary character results in two output Javachar`s - high surrogate and low surrogate code units. Because of that, ifread()method encounters supplementary code point in the input stream, it returns UTF-16-encoded high surrogate code unit and stores low surrogate in buffer. When called next time,read()will return this low surrogate, instead of reading more bytes from theInputStream. Similarly ifread(char[], int, int)is invoked to read, for example, 10 chars into specified buffer, and 4 of them turn out to be supplementary Unicode characters, each written as two chars, then we end up having 4 excess chars that we cannot immediately return or push back to the input stream. So we need to store them in the buffer awaiting furtherreadinvocations. Note that char buffer functions like a stack, i.e. chars and surrogate pairs are stored in reverse order.
-
fCharCount
protected int fCharCount
Count of Java chars currently being stored in in thefCharBufarray.
-
-
Constructor Detail
-
UCSReader
public UCSReader(InputStream inputStream, short encoding)
Constructs anISO-10646-UCS-(2|4)reader from the specified input stream using default buffer size. The Endianness and exact input encoding (UCS-2orUCS-4) also should be known in advance.- Parameters:
inputStream- input stream with UCS-2|4 encoded dataencoding- One of UCS2LE, UCS2BE, UCS4LE or UCS4BE.
-
UCSReader
public UCSReader(InputStream inputStream, int size, short encoding)
Constructs anISO-10646-UCS-(2|4)reader from the source input stream using explicitly specified initial buffer size. Endianness and exact input encoding (UCS-2orUCS-4) also should be known in advance.- Parameters:
inputStream- input stream with UCS-2|4 encoded datasize- The initial buffer size. You better make sure this number is divisible by 4 if you plan to to read UCS-4 with this class.encoding- One of UCS2LE, UCS2BE, UCS4LE or UCS4BE
-
-
Method Detail
-
read
public int read() throws IOExceptionRead a single character. This method will block until a character is available, an I/O error occurs, or the end of the stream is reached.If supplementary Unicode character is encountered in
UCS-4input, it will be encoded intoUTF-16surrogate pair according to RFC 2781. High surrogate code unit will be returned immediately, and low surrogate saved in the internal buffer to be read during nextread()orread(char[], int, int)invocation. -AK- Overrides:
readin classReader- Returns:
- Java 16-bit
charvalue containing UTF-16 code unit which may be either code point from Basic Multilingual Plane or one of the surrogate code units (high or low) of the pair representing supplementary Unicode character (one in0x10000 - 0x10FFFFrange) -AK - Throws:
IOException- when I/O error occurs
-
read
public int read(char[] ch, int offset, int length) throws IOExceptionRead characters into a portion of an array. This method will block until some input is available, an I/O error occurs, or the end of the stream is reached.I suspect that the whole stuff works awfully slow, so if you know for sure that your
UCS-4input does not contain any supplementary code points you probably should use originalUCSReaderclass from Xerces team (org.apache.xerces.impl.io.UCSReader). -AK- Specified by:
readin classReader- Parameters:
ch- Destination bufferoffset- Offset at which to start storing characterslength- Maximum number of characters to read- Returns:
- The number of characters read, or
-1if the end of the stream has been reached. Note that this is not a number ofUCS-4characters read, but instead number ofUTF-16code units. These two are equal only if there were no supplementary Unicode code points among read chars. - Throws:
IOException- If an I/O error occurs
-
readUCS2
protected int readUCS2(char[] ch, int offset, int length) throws IOExceptionReadUCS-2characters into a portion of an array. This method will block until some input is available, an I/O error occurs, or the end of the stream is reached.In original
UCSReaderthis code was part ofread(char[], int, int)method, but I removed it from there to reduce complexity of the latter.- Parameters:
ch- destination bufferoffset- offset at which to start storing characterslength- maximum number of characters to read- Returns:
- The number of characters read, or
-1if the end of the stream has been reached - Throws:
IOException- If an I/O error occurs
-
skip
public long skip(long n) throws IOExceptionSkip characters. This method will block until some characters are available, an I/O error occurs, or the end of the stream is reached.- Overrides:
skipin classReader- Parameters:
n- The number of characters to skip- Returns:
- The number of characters actually skipped
- Throws:
IOException- If an I/O error occurs
-
ready
public boolean ready() throws IOExceptionTell whether this stream is ready to be read.- Overrides:
readyin classReader- Returns:
- True if the next read() is guaranteed not to block for input, false otherwise. Note that returning false does not guarantee that the next read will block.
- Throws:
IOException- If an I/O error occurs
-
markSupported
public boolean markSupported()
Tell whether this stream supports the mark() operation.- Overrides:
markSupportedin classReader
-
mark
public void mark(int readAheadLimit) throws IOExceptionMark the present position in the stream. Subsequent calls toresetwill attempt to reposition the stream to this point. Not all character-input streams support themarkoperation. This is one of them :) It relies on marking facilities of underlying byte stream.- Overrides:
markin classReader- Parameters:
readAheadLimit- Limit on the number of characters that may be read while still preserving the mark. After reading this many characters, attempting to reset the stream may fail.- Throws:
IOException- If the stream does not supportmark, or if some other I/O error occurs
-
reset
public void reset() throws IOExceptionReset the stream. If the stream has been marked, then attempt to reposition it at the mark. If the stream has not been marked, then attempt to reset it in some way appropriate to the particular stream, for example by repositioning it to its starting point. This stream implementation does not supportmark/resetby itself, it relies on underlying byte stream in this matter.- Overrides:
resetin classReader- Throws:
IOException- If the stream has not been marked, or if the mark has been invalidated, or if the stream does not support reset(), or if some other I/O error occurs
-
close
public void close() throws IOExceptionClose the stream. Once a stream has been closed, furtherread,ready,mark, orresetinvocations will throw an IOException. Closing a previously-closed stream, however, has no effect.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Specified by:
closein classReader- Throws:
IOException- If an I/O error occurs
-
getEncoding
public String getEncoding()
Returns the encoding currently in use by this character stream.- Returns:
- Encoding of this stream. Either ISO-10646-UCS-2 or ISO-10646-UCS-4. Problem is that
this string doesn't indicate the byte order of that encoding. What to do, then? Unlike
UTF-16 byte order cannot be made part of the encoding name in this case and still can be
critical. Currently you can find out the byte order by invoking
getByteOrdermethod.
-
getByteOrder
public String getByteOrder()
Returns byte order ("endianness") of the encoding currently in use by this character stream. This is a string with two possible values:LITTLE_ENDIANandBIG_ENDIAN. Maybe using a named constant is a better alternative, but I just don't like them. But feel free to change this behavior if you think that would be better.- Returns:
LITTLE_ENDIANorBIG_ENDIANdepending on byte order of current encoding of this stream.
-
isSupplementaryCodePoint
protected boolean isSupplementaryCodePoint(int codePoint)
Determines whether the specified character (Unicode code point) is in the supplementary character range. The method call is equivalent to the expression:
Stolen from JDK 1.5codePoint >= 0x10000 && codePoint <= 0x10ffff
java.lang.Characterclass in order to provide JDK 1.4 compatibility.- Parameters:
codePoint- the character (Unicode code point) to be tested- Returns:
trueif the specified character is in the Unicode supplementary character range;falseotherwise.
-
-