public final class StringTracker
extends java.lang.Object
| Modifier and Type | Field and Description |
|---|---|
static org.apache.datasketches.ArrayOfStringsSerDe |
ARRAY_OF_STRINGS_SER_DE |
static int |
MAX_FREQUENT_ITEM_SIZE |
static java.util.function.Function<java.lang.String,java.util.List<java.lang.String>> |
TOKENIZER |
| Constructor and Description |
|---|
StringTracker() |
| Modifier and Type | Method and Description |
|---|---|
static StringTracker |
fromProtobuf(com.whylogs.core.message.StringsMessage message) |
StringTracker |
merge(StringTracker other)
Merge this StringTracker object with another.
|
com.whylogs.core.message.StringsMessage.Builder |
toProtobuf() |
void |
update(java.lang.String value)
Track statistical properties of characters in a string.
|
void |
update(java.lang.String value,
java.lang.String charString)
Track statistical properties of just the characters from a given character set.
|
void |
update(java.lang.String value,
java.lang.String charString,
java.util.function.Function<java.lang.String,java.util.List<java.lang.String>> tokenizer)
Track statistical properties of a string.
|
public static java.util.function.Function<java.lang.String,java.util.List<java.lang.String>> TOKENIZER
public static final org.apache.datasketches.ArrayOfStringsSerDe ARRAY_OF_STRINGS_SER_DE
public static final int MAX_FREQUENT_ITEM_SIZE
public void update(java.lang.String value)
`value` is a Unicode string. `value` is tokenized and tokens are passed to CharPosTracker for tracking of position and frequency of unicode codepoints in the token.
Variants of this function signature allow modification of tokenizer and tracked character set during updates. Unless overridden by one of the other update routines, uses a tokenizer that breaks strings at spaces, and tracks alphanumeric lowercase characters.
value - stringpublic void update(java.lang.String value,
java.lang.String charString)
`value` is tokenized, and position and frequency of unicode codepoints within tokens are tracked if they appear in `charString`. If set, `charString` will be applied to subsequent calls to update, overriding the default character set.
value - string Unicode string to be trackedcharString - string - Set of characters that should be tracked. all others will be tracked
as 'NITL'public void update(java.lang.String value,
java.lang.String charString,
java.util.function.Function<java.lang.String,java.util.List<java.lang.String>> tokenizer)
`value` is tokenized according to `tokenizer`. Position and frequency of unicode codepoints within tokens are tracked if they appear in `charString`. If set, `charString` and/or `tokenizer` will be used for subsequent calls to `update`
value - stringcharString - string - Set of characters that should be tracked. all others will be tracked
as 'NITL'tokenizer - function taking string and returning list of strings.public StringTracker merge(StringTracker other)
other - the other String tracker to mergepublic com.whylogs.core.message.StringsMessage.Builder toProtobuf()
public static StringTracker fromProtobuf(com.whylogs.core.message.StringsMessage message)