Class Decoder


  • public class Decoder
    extends Object
    This class offers methods to decode the binary format of a sequence of text-embedding pairs. It also offers code for merging multiple streams of ordered text-embedding pair sequences in binary format into a single output where for each text occurrence, all its embedding vectors are averaged.
    • Constructor Detail

      • Decoder

        public Decoder()
    • Method Detail

      • mergeEmbeddingFiles

        public static void mergeEmbeddingFiles​(List<InputStream> inputStreams,
                                               OutputStream os,
                                               boolean groupByText)
                                        throws IOException
        Merge multiple InputStreams of byte encoded text-embeddingvector pairs into a single OutputStream. All sources are expected to be sorted lexicographically ascending with regards to the text elements. If groupByText is set to true, the output will be grouped by the text. Each text element will then only exist once in the output and all its originally associated vectors will be collapsed into one averaged vector for this text.
        Parameters:
        inputStreams - The input streams holding the text-vector pairs in byte format.
        os - The output stream to write the text-vector pairs in byte format to.
        groupByText - If the vectors associated with the same text should be averaged into a single vector, a centroid.
        Throws:
        IOException
      • readDouble

        public static double readDouble​(byte[] dest,
                                        ByteBuffer bb,
                                        InputStream is)
                                 throws IOException
        Reads a double from bb with reloading from is if necessary.
        Parameters:
        dest - An Double.BYTES sized array to hold the double bytes.
        bb - A ByteBuffer that is used to buffer input from is.
        is - The original InputStream that is read.
        Returns:
        The read double value.
        Throws:
        IOException - If reading fails.
      • readInt

        public static int readInt​(byte[] dest,
                                  ByteBuffer bb,
                                  InputStream is)
                           throws IOException
        Reads an integer from bb with reloading from is if necessary.
        Parameters:
        dest - An Integer.BYTES sized array to hold the integer bytes.
        bb - A ByteBuffer that is used to buffer input from is.
        is - The original InputStream that is read.
        Returns:
        The read integer.
        Throws:
        IOException - If reading fails.
      • readNumberOfBytes

        public static void readNumberOfBytes​(byte[] dest,
                                             ByteBuffer bb,
                                             InputStream is)
                                      throws IOException
        Reads bytes from bb until dest is full. If bb is exhausted before dest could be filled, bytes from is are read into the backing byte[] of bb and the position of bb is set to 0 and reading continues until dest is full or there are no more bytes available from is.
        Parameters:
        dest - The destination array to fill from the InputStream through the given ByteBuffer.
        bb - A ByteBuffer that may contain already read contents from is.
        is - The original input stream from which bytes are read into bb for further consumption.
        Throws:
        IOException - If reading from is fails.