Class TextChunker


  • public class TextChunker
    extends Object
    Split text in chunks, attempting to leave meaning intact. For plain text, split looking at new lines first, then periods, and so on. For markdown, split looking at punctuation first, and so on.
    • Constructor Detail

      • TextChunker

        public TextChunker()
    • Method Detail

      • splitPlainTextLines

        public static List<String> splitPlainTextLines​(String text,
                                                       int maxTokensPerLine)
        Split plain text into lines
        Parameters:
        text - Text to split
        maxTokensPerLine - Maximum number of tokens per line
        Returns:
        List of lines
      • splitMarkDownLines

        public static List<String> splitMarkDownLines​(String text,
                                                      int maxTokensPerLine)
        Split markdown text into lines
        Parameters:
        text - Text to split
        maxTokensPerLine - Maximum number of tokens per line
        Returns:
        List of lines
      • splitPlainTextParagraphs

        public static List<String> splitPlainTextParagraphs​(List<String> lines,
                                                            int maxTokensPerParagraph)
        Split plain text into paragraphs
        Parameters:
        lines - Lines of text
        maxTokensPerParagraph - Maximum number of tokens per paragraph.
        Returns:
        List of paragraphs
      • splitMarkdownParagraphs

        public static List<String> splitMarkdownParagraphs​(List<String> lines,
                                                           int maxTokensPerParagraph)
        Split markdown text into paragraphs
        Parameters:
        lines - Lines of text
        maxTokensPerParagraph - Maximum number of tokens per paragraph
        Returns:
        List of paragraphs