Package ai.djl.modality.nlp.preprocess
Contains utility classes for natural language pre-processing tasks.
-
Interface Summary Interface Description TextProcessor TextProcessorallows applying pre-processing to input tokens for natural language applications.Tokenizer Tokenizerinterface provides the ability to break-down sentences into embeddable tokens. -
Class Summary Class Description HyphenNormalizer Unicode normalization does not take care of "exotic" hyphens that we normally do not want in NLP input.LambdaProcessor TextProcessorwill apply user defined lambda function on input tokens.LowerCaseConvertor LowerCaseConvertorconverts every character of the input tokens to it's respective lower case character.PunctuationSeparator PunctuationSeparatorseparates punctuation into a separate token.SimpleTokenizer SimpleTokenizeris an implementation of theTokenizerinterface that converts sentences into token by splitting them by a given delimiter.TextCleaner Applies remove or replace of certain characters based on condition.TextTerminator ATextProcessorthat adds a beginning of string and end of string token.TextTruncator TextProcessorthat truncates text to a maximum size.UnicodeNormalizer Applies unicode normalization to input strings.