Package de.julielab.jcore.consumer.ew
Class Decoder
- java.lang.Object
-
- de.julielab.jcore.consumer.ew.Decoder
-
public class Decoder extends Object
This class offers methods to decode the binary format of a sequence of text-embedding pairs. It also offers code for merging multiple streams of ordered text-embedding pair sequences in binary format into a single output where for each text occurrence, all its embedding vectors are averaged.
-
-
Constructor Summary
Constructors Constructor Description Decoder()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static org.apache.commons.lang3.tuple.Pair<List<String>,List<double[]>>decodeBinaryEmbeddingVectors(InputStream is)static org.apache.commons.lang3.tuple.Pair<List<String>,List<double[]>>decodeBinaryEmbeddingVectors(InputStream is, int bufferSize)static voidmergeEmbeddingFiles(List<InputStream> inputStreams, OutputStream os, boolean groupByText)Merge multiple InputStreams of byte encoded text-embeddingvector pairs into a single OutputStream.static doublereadDouble(byte[] dest, ByteBuffer bb, InputStream is)Reads a double from bb with reloading from is if necessary.static intreadInt(byte[] dest, ByteBuffer bb, InputStream is)Reads an integer from bb with reloading from is if necessary.static voidreadNumberOfBytes(byte[] dest, ByteBuffer bb, InputStream is)Reads bytes from bb until dest is full.
-
-
-
Method Detail
-
decodeBinaryEmbeddingVectors
public static org.apache.commons.lang3.tuple.Pair<List<String>,List<double[]>> decodeBinaryEmbeddingVectors(InputStream is) throws IOException
- Throws:
IOException
-
decodeBinaryEmbeddingVectors
public static org.apache.commons.lang3.tuple.Pair<List<String>,List<double[]>> decodeBinaryEmbeddingVectors(InputStream is, int bufferSize) throws IOException
- Throws:
IOException
-
mergeEmbeddingFiles
public static void mergeEmbeddingFiles(List<InputStream> inputStreams, OutputStream os, boolean groupByText) throws IOException
Merge multiple InputStreams of byte encoded text-embeddingvector pairs into a single OutputStream. All sources are expected to be sorted lexicographically ascending with regards to the text elements. If groupByText is set to true, the output will be grouped by the text. Each text element will then only exist once in the output and all its originally associated vectors will be collapsed into one averaged vector for this text.- Parameters:
inputStreams- The input streams holding the text-vector pairs in byte format.os- The output stream to write the text-vector pairs in byte format to.groupByText- If the vectors associated with the same text should be averaged into a single vector, a centroid.- Throws:
IOException
-
readDouble
public static double readDouble(byte[] dest, ByteBuffer bb, InputStream is) throws IOExceptionReads a double from bb with reloading from is if necessary.- Parameters:
dest- AnDouble.BYTESsized array to hold the double bytes.bb- A ByteBuffer that is used to buffer input from is.is- The original InputStream that is read.- Returns:
- The read double value.
- Throws:
IOException- If reading fails.
-
readInt
public static int readInt(byte[] dest, ByteBuffer bb, InputStream is) throws IOExceptionReads an integer from bb with reloading from is if necessary.- Parameters:
dest- AnInteger.BYTESsized array to hold the integer bytes.bb- A ByteBuffer that is used to buffer input from is.is- The original InputStream that is read.- Returns:
- The read integer.
- Throws:
IOException- If reading fails.
-
readNumberOfBytes
public static void readNumberOfBytes(byte[] dest, ByteBuffer bb, InputStream is) throws IOExceptionReads bytes from bb until dest is full. If bb is exhausted before dest could be filled, bytes from is are read into the backing byte[] of bb and the position of bb is set to 0 and reading continues until dest is full or there are no more bytes available from is.- Parameters:
dest- The destination array to fill from the InputStream through the given ByteBuffer.bb- A ByteBuffer that may contain already read contents from is.is- The original input stream from which bytes are read into bb for further consumption.- Throws:
IOException- If reading from is fails.
-
-