Interface DocumentSplitter

  • All Implemented Interfaces:

    
    public interface DocumentSplitter
    
                        

    Defines the interface for splitting a document into text segments. This is necessary as LLMs have a limited context window, making it impossible to send the entire document at once. Therefore, the document should first be split into segments, and only the relevant segments should be sent to LLM. DocumentSplitters.recursive() from a dev.langchain4j:langchain4j module is a good starting point.

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
    • Field Summary

      Fields 
      Modifier and Type Field Description
    • Constructor Summary

      Constructors 
      Constructor Description
    • Enum Constant Summary

      Enum Constants 
      Enum Constant Description
    • Method Summary

      Modifier and Type Method Description
      abstract List<TextSegment> split(Document document) Splits a single Document into a list of TextSegment objects.
      List<TextSegment> splitAll(List<Document> documents) Splits a list of Documents into a list of TextSegment objects.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

    • Method Detail

      • split

         abstract List<TextSegment> split(Document document)

        Splits a single Document into a list of TextSegment objects. The metadata is typically copied from the document and enriched with segment-specific information, such as position in the document, page number, etc.

        Parameters:
        document - The Document to be split.
        Returns:

        A list of TextSegment objects derived from the input Document.

      • splitAll

         List<TextSegment> splitAll(List<Document> documents)

        Splits a list of Documents into a list of TextSegment objects. This is a convenience method that calls the split method for each Document in the list.

        Parameters:
        documents - The list of Documents to be split.
        Returns:

        A list of TextSegment objects derived from the input Documents.