Class UCSReader

  • All Implemented Interfaces:
    Closeable, AutoCloseable, Readable

    public class UCSReader
    extends Reader
    Reader for UCS-2 and UCS-4 encodings. (more precisely ISO-10646-UCS-(2|4) encodings).

    This variant is modified to handle supplementary Unicode code points correctly. Though this required a lot of new code and definitely reduced the perfomance comparing to original version. I tried my best to preserve exsiting code and comments whenever it was possible. I performed some basic tests, but not too thorough ones, so some bugs may still nest in the code. -AK

    Version:
    $Id$
    Author:
    Neil Graham, IBM
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int CHAR_BUFFER_INITIAL_SIZE
      Starting size of the internal char buffer.
      static int DEFAULT_BUFFER_SIZE
      Default byte buffer size (8192, larger than that of ASCIIReader since it's reasonable to surmise that the average UCS-4-encoded file should be 4 times as large as the average ASCII-encoded file).
      protected byte[] fBuffer
      Byte buffer.
      protected char[] fCharBuf
      Stores aforeread or "excess" characters that may appear during read methods invocation due to the fact that one input UCS-4 supplementary character results in two output Java char`s - high surrogate and low surrogate code units.
      protected int fCharCount
      Count of Java chars currently being stored in in the fCharBuf array.
      protected short fEncoding
      what kind of data we're dealing with
      protected InputStream fInputStream
      Input stream.
      static int MAX_CODE_POINT
      The maximum value of a Unicode code point.
      static int MIN_CODE_POINT
      The minimum value of a Unicode code point.
      static int MIN_SUPPLEMENTARY_CODE_POINT
      The minimum value of a supplementary code point.
      static short UCS2BE  
      static short UCS2LE  
      static short UCS4BE  
      static short UCS4LE  
    • Constructor Summary

      Constructors 
      Constructor Description
      UCSReader​(InputStream inputStream, int size, short encoding)
      Constructs an ISO-10646-UCS-(2|4) reader from the source input stream using explicitly specified initial buffer size.
      UCSReader​(InputStream inputStream, short encoding)
      Constructs an ISO-10646-UCS-(2|4) reader from the specified input stream using default buffer size.
    • Field Detail

      • DEFAULT_BUFFER_SIZE

        public static final int DEFAULT_BUFFER_SIZE
        Default byte buffer size (8192, larger than that of ASCIIReader since it's reasonable to surmise that the average UCS-4-encoded file should be 4 times as large as the average ASCII-encoded file).
        See Also:
        Constant Field Values
      • CHAR_BUFFER_INITIAL_SIZE

        public static final int CHAR_BUFFER_INITIAL_SIZE
        Starting size of the internal char buffer. Internal char buffer is maintained to hold excess chars that may left from previous read operation when working with UCS-4 data (never used for UCS-2).
        See Also:
        Constant Field Values
      • MIN_SUPPLEMENTARY_CODE_POINT

        public static final int MIN_SUPPLEMENTARY_CODE_POINT
        The minimum value of a supplementary code point.
        See Also:
        Constant Field Values
      • MIN_CODE_POINT

        public static final int MIN_CODE_POINT
        The minimum value of a Unicode code point.
        See Also:
        Constant Field Values
      • MAX_CODE_POINT

        public static final int MAX_CODE_POINT
        The maximum value of a Unicode code point.
        See Also:
        Constant Field Values
      • fInputStream

        protected InputStream fInputStream
        Input stream.
      • fBuffer

        protected byte[] fBuffer
        Byte buffer.
      • fEncoding

        protected short fEncoding
        what kind of data we're dealing with
      • fCharBuf

        protected char[] fCharBuf
        Stores aforeread or "excess" characters that may appear during read methods invocation due to the fact that one input UCS-4 supplementary character results in two output Java char`s - high surrogate and low surrogate code units. Because of that, if read() method encounters supplementary code point in the input stream, it returns UTF-16-encoded high surrogate code unit and stores low surrogate in buffer. When called next time, read() will return this low surrogate, instead of reading more bytes from the InputStream. Similarly if read(char[], int, int) is invoked to read, for example, 10 chars into specified buffer, and 4 of them turn out to be supplementary Unicode characters, each written as two chars, then we end up having 4 excess chars that we cannot immediately return or push back to the input stream. So we need to store them in the buffer awaiting further read invocations. Note that char buffer functions like a stack, i.e. chars and surrogate pairs are stored in reverse order.
      • fCharCount

        protected int fCharCount
        Count of Java chars currently being stored in in the fCharBuf array.
    • Constructor Detail

      • UCSReader

        public UCSReader​(InputStream inputStream,
                         short encoding)
        Constructs an ISO-10646-UCS-(2|4) reader from the specified input stream using default buffer size. The Endianness and exact input encoding (UCS-2 or UCS-4) also should be known in advance.
        Parameters:
        inputStream - input stream with UCS-2|4 encoded data
        encoding - One of UCS2LE, UCS2BE, UCS4LE or UCS4BE.
      • UCSReader

        public UCSReader​(InputStream inputStream,
                         int size,
                         short encoding)
        Constructs an ISO-10646-UCS-(2|4) reader from the source input stream using explicitly specified initial buffer size. Endianness and exact input encoding (UCS-2 or UCS-4) also should be known in advance.
        Parameters:
        inputStream - input stream with UCS-2|4 encoded data
        size - The initial buffer size. You better make sure this number is divisible by 4 if you plan to to read UCS-4 with this class.
        encoding - One of UCS2LE, UCS2BE, UCS4LE or UCS4BE
    • Method Detail

      • read

        public int read()
                 throws IOException
        Read a single character. This method will block until a character is available, an I/O error occurs, or the end of the stream is reached.

        If supplementary Unicode character is encountered in UCS-4 input, it will be encoded into UTF-16 surrogate pair according to RFC 2781. High surrogate code unit will be returned immediately, and low surrogate saved in the internal buffer to be read during next read() or read(char[], int, int) invocation. -AK

        Overrides:
        read in class Reader
        Returns:
        Java 16-bit char value containing UTF-16 code unit which may be either code point from Basic Multilingual Plane or one of the surrogate code units (high or low) of the pair representing supplementary Unicode character (one in 0x10000 - 0x10FFFF range) -AK
        Throws:
        IOException - when I/O error occurs
      • read

        public int read​(char[] ch,
                        int offset,
                        int length)
                 throws IOException
        Read characters into a portion of an array. This method will block until some input is available, an I/O error occurs, or the end of the stream is reached.

        I suspect that the whole stuff works awfully slow, so if you know for sure that your UCS-4 input does not contain any supplementary code points you probably should use original UCSReader class from Xerces team ( org.apache.xerces.impl.io.UCSReader). -AK

        Specified by:
        read in class Reader
        Parameters:
        ch - Destination buffer
        offset - Offset at which to start storing characters
        length - Maximum number of characters to read
        Returns:
        The number of characters read, or -1 if the end of the stream has been reached. Note that this is not a number of UCS-4 characters read, but instead number of UTF-16 code units. These two are equal only if there were no supplementary Unicode code points among read chars.
        Throws:
        IOException - If an I/O error occurs
      • readUCS2

        protected int readUCS2​(char[] ch,
                               int offset,
                               int length)
                        throws IOException
        Read UCS-2 characters into a portion of an array. This method will block until some input is available, an I/O error occurs, or the end of the stream is reached.

        In original UCSReader this code was part of read(char[], int, int) method, but I removed it from there to reduce complexity of the latter.

        Parameters:
        ch - destination buffer
        offset - offset at which to start storing characters
        length - maximum number of characters to read
        Returns:
        The number of characters read, or -1 if the end of the stream has been reached
        Throws:
        IOException - If an I/O error occurs
      • skip

        public long skip​(long n)
                  throws IOException
        Skip characters. This method will block until some characters are available, an I/O error occurs, or the end of the stream is reached.
        Overrides:
        skip in class Reader
        Parameters:
        n - The number of characters to skip
        Returns:
        The number of characters actually skipped
        Throws:
        IOException - If an I/O error occurs
      • ready

        public boolean ready()
                      throws IOException
        Tell whether this stream is ready to be read.
        Overrides:
        ready in class Reader
        Returns:
        True if the next read() is guaranteed not to block for input, false otherwise. Note that returning false does not guarantee that the next read will block.
        Throws:
        IOException - If an I/O error occurs
      • markSupported

        public boolean markSupported()
        Tell whether this stream supports the mark() operation.
        Overrides:
        markSupported in class Reader
      • mark

        public void mark​(int readAheadLimit)
                  throws IOException
        Mark the present position in the stream. Subsequent calls to reset will attempt to reposition the stream to this point. Not all character-input streams support the mark operation. This is one of them :) It relies on marking facilities of underlying byte stream.
        Overrides:
        mark in class Reader
        Parameters:
        readAheadLimit - Limit on the number of characters that may be read while still preserving the mark. After reading this many characters, attempting to reset the stream may fail.
        Throws:
        IOException - If the stream does not support mark, or if some other I/O error occurs
      • reset

        public void reset()
                   throws IOException
        Reset the stream. If the stream has been marked, then attempt to reposition it at the mark. If the stream has not been marked, then attempt to reset it in some way appropriate to the particular stream, for example by repositioning it to its starting point. This stream implementation does not support mark/reset by itself, it relies on underlying byte stream in this matter.
        Overrides:
        reset in class Reader
        Throws:
        IOException - If the stream has not been marked, or if the mark has been invalidated, or if the stream does not support reset(), or if some other I/O error occurs
      • close

        public void close()
                   throws IOException
        Close the stream. Once a stream has been closed, further read, ready , mark, or reset invocations will throw an IOException. Closing a previously-closed stream, however, has no effect.
        Specified by:
        close in interface AutoCloseable
        Specified by:
        close in interface Closeable
        Specified by:
        close in class Reader
        Throws:
        IOException - If an I/O error occurs
      • getEncoding

        public String getEncoding()
        Returns the encoding currently in use by this character stream.
        Returns:
        Encoding of this stream. Either ISO-10646-UCS-2 or ISO-10646-UCS-4. Problem is that this string doesn't indicate the byte order of that encoding. What to do, then? Unlike UTF-16 byte order cannot be made part of the encoding name in this case and still can be critical. Currently you can find out the byte order by invoking getByteOrder method.
      • getByteOrder

        public String getByteOrder()
        Returns byte order ("endianness") of the encoding currently in use by this character stream. This is a string with two possible values: LITTLE_ENDIAN and BIG_ENDIAN . Maybe using a named constant is a better alternative, but I just don't like them. But feel free to change this behavior if you think that would be better.
        Returns:
        LITTLE_ENDIAN or BIG_ENDIAN depending on byte order of current encoding of this stream.
      • isSupplementaryCodePoint

        protected boolean isSupplementaryCodePoint​(int codePoint)
        Determines whether the specified character (Unicode code point) is in the supplementary character range. The method call is equivalent to the expression:
         codePoint >= 0x10000 && codePoint <= 0x10ffff
         
        Stolen from JDK 1.5 java.lang.Character class in order to provide JDK 1.4 compatibility.
        Parameters:
        codePoint - the character (Unicode code point) to be tested
        Returns:
        true if the specified character is in the Unicode supplementary character range; false otherwise.