public class Extractor extends Object
| Modifier and Type | Class and Description |
|---|---|
static class |
Extractor.Entity |
| Modifier and Type | Field and Description |
|---|---|
protected boolean |
extractURLWithoutProtocol |
static int |
MAX_TCO_SLUG_LENGTH
The maximum t.co path length that the Twitter backend supports.
|
static int |
MAX_URL_LENGTH
The maximum url length that the Twitter backend supports.
|
| Constructor and Description |
|---|
Extractor()
Create a new extractor.
|
| Modifier and Type | Method and Description |
|---|---|
List<String> |
extractCashtags(String text)
Extract $cashtag references from Tweet text.
|
List<Extractor.Entity> |
extractCashtagsWithIndices(String text)
Extract $cashtag references from Tweet text.
|
List<Extractor.Entity> |
extractEntitiesWithIndices(String text)
Extract URLs, @mentions, lists and #hashtag from a given text/tweet.
|
List<String> |
extractHashtags(String text)
Extract #hashtag references from Tweet text.
|
List<Extractor.Entity> |
extractHashtagsWithIndices(String text)
Extract #hashtag references from Tweet text.
|
List<String> |
extractMentionedScreennames(String text)
Extract @username references from Tweet text.
|
List<Extractor.Entity> |
extractMentionedScreennamesWithIndices(String text)
Extract @username references from Tweet text.
|
List<Extractor.Entity> |
extractMentionsOrListsWithIndices(String text)
Extract @username and an optional list reference from Tweet text.
|
String |
extractReplyScreenname(String text)
Extract a @username reference from the beginning of Tweet text.
|
List<String> |
extractURLs(String text)
Extract URL references from Tweet text.
|
List<Extractor.Entity> |
extractURLsWithIndices(String text)
Extract URL references from Tweet text.
|
boolean |
isExtractURLWithoutProtocol() |
static boolean |
isValidHostAndLength(int originalUrlLength,
String protocol,
String originalHost)
Verifies that the host name adheres to RFC 3490 and 1035
Also, verifies that the entire url (including protocol) doesn't exceed MAX_URL_LENGTH
|
void |
modifyIndicesFromUnicodeToUTF16(String text,
List<Extractor.Entity> entities)
Modify Unicode-based indices of the entities to UTF-16 based indices.
|
void |
modifyIndicesFromUTF16ToUnicode(String text,
List<Extractor.Entity> entities)
Modify UTF-16-based indices of the entities to Unicode-based indices.
|
void |
setExtractURLWithoutProtocol(boolean extractURLWithoutProtocol) |
public static final int MAX_URL_LENGTH
public static final int MAX_TCO_SLUG_LENGTH
protected boolean extractURLWithoutProtocol
public List<Extractor.Entity> extractEntitiesWithIndices(String text)
text - text of tweetpublic List<String> extractMentionedScreennames(String text)
text - of the tweet from which to extract usernamespublic List<Extractor.Entity> extractMentionedScreennamesWithIndices(String text)
text - of the tweet from which to extract usernamespublic List<Extractor.Entity> extractMentionsOrListsWithIndices(String text)
text - of the tweet from which to extract usernamespublic String extractReplyScreenname(String text)
text - of the tweet from which to extract the replied to username@Nonnull public List<String> extractURLs(@Nullable String text)
text - of the tweet from which to extract URLs@Nonnull public List<Extractor.Entity> extractURLsWithIndices(@Nullable String text)
text - of the tweet from which to extract URLspublic static boolean isValidHostAndLength(int originalUrlLength,
@Nullable
String protocol,
@Nullable
String originalHost)
originalUrlLength - The length of the entire URL, including protocol if anyprotocol - The protocol usedoriginalHost - The hostname to check validity ofpublic List<String> extractHashtags(String text)
text - of the tweet from which to extract hashtagspublic List<Extractor.Entity> extractHashtagsWithIndices(String text)
text - of the tweet from which to extract hashtagspublic List<String> extractCashtags(String text)
text - of the tweet from which to extract cashtagspublic List<Extractor.Entity> extractCashtagsWithIndices(String text)
text - of the tweet from which to extract cashtagspublic void setExtractURLWithoutProtocol(boolean extractURLWithoutProtocol)
public boolean isExtractURLWithoutProtocol()
public void modifyIndicesFromUnicodeToUTF16(String text, List<Extractor.Entity> entities)
In UTF-16 based indices, Unicode supplementary characters are counted as two characters.
This method requires that the list of entities be in ascending order by start index.
text - original textentities - entities with Unicode based indicespublic void modifyIndicesFromUTF16ToUnicode(String text, List<Extractor.Entity> entities)
In Unicode-based indices, Unicode supplementary characters are counted as single characters.
This method requires that the list of entities be in ascending order by start index.
text - original textentities - entities with UTF-16 based indicesCopyright © 2018. All rights reserved.