public class Extractor
extends java.lang.Object
Modifier and Type | Class and Description |
---|---|
static class |
Extractor.Entity |
Modifier and Type | Field and Description |
---|---|
protected boolean |
extractURLWithoutProtocol |
static int |
MAX_TCO_SLUG_LENGTH
The maximum t.co path length that the Twitter backend supports.
|
static int |
MAX_URL_LENGTH
The maximum url length that the Twitter backend supports.
|
Constructor and Description |
---|
Extractor()
Create a new extractor.
|
Modifier and Type | Method and Description |
---|---|
java.util.List<java.lang.String> |
extractCashtags(java.lang.String text)
Extract $cashtag references from Tweet text.
|
java.util.List<Extractor.Entity> |
extractCashtagsWithIndices(java.lang.String text)
Extract $cashtag references from Tweet text.
|
java.util.List<Extractor.Entity> |
extractEntitiesWithIndices(java.lang.String text)
Extract URLs, @mentions, lists and #hashtag from a given text/tweet.
|
java.util.List<java.lang.String> |
extractHashtags(java.lang.String text)
Extract #hashtag references from Tweet text.
|
java.util.List<Extractor.Entity> |
extractHashtagsWithIndices(java.lang.String text)
Extract #hashtag references from Tweet text.
|
java.util.List<java.lang.String> |
extractMentionedScreennames(java.lang.String text)
Extract @username references from Tweet text.
|
java.util.List<Extractor.Entity> |
extractMentionedScreennamesWithIndices(java.lang.String text)
Extract @username references from Tweet text.
|
java.util.List<Extractor.Entity> |
extractMentionsOrListsWithIndices(java.lang.String text)
Extract @username and an optional list reference from Tweet text.
|
java.lang.String |
extractReplyScreenname(java.lang.String text)
Extract a @username reference from the beginning of Tweet text.
|
java.util.List<java.lang.String> |
extractURLs(java.lang.String text)
Extract URL references from Tweet text.
|
java.util.List<Extractor.Entity> |
extractURLsWithIndices(java.lang.String text)
Extract URL references from Tweet text.
|
boolean |
isExtractURLWithoutProtocol() |
static boolean |
isValidHostAndLength(int originalUrlLength,
java.lang.String protocol,
java.lang.String originalHost)
Verifies that the host name adheres to RFC 3490 and 1035
Also, verifies that the entire url (including protocol) doesn't exceed MAX_URL_LENGTH
|
void |
modifyIndicesFromUnicodeToUTF16(java.lang.String text,
java.util.List<Extractor.Entity> entities)
Modify Unicode-based indices of the entities to UTF-16 based indices.
|
void |
modifyIndicesFromUTF16ToUnicode(java.lang.String text,
java.util.List<Extractor.Entity> entities)
Modify UTF-16-based indices of the entities to Unicode-based indices.
|
void |
setExtractURLWithoutProtocol(boolean extractURLWithoutProtocol) |
public static final int MAX_URL_LENGTH
public static final int MAX_TCO_SLUG_LENGTH
protected boolean extractURLWithoutProtocol
public java.util.List<Extractor.Entity> extractEntitiesWithIndices(java.lang.String text)
text
- text of tweetpublic java.util.List<java.lang.String> extractMentionedScreennames(java.lang.String text)
text
- of the tweet from which to extract usernamespublic java.util.List<Extractor.Entity> extractMentionedScreennamesWithIndices(java.lang.String text)
text
- of the tweet from which to extract usernamespublic java.util.List<Extractor.Entity> extractMentionsOrListsWithIndices(java.lang.String text)
text
- of the tweet from which to extract usernamespublic java.lang.String extractReplyScreenname(java.lang.String text)
text
- of the tweet from which to extract the replied to usernamepublic java.util.List<java.lang.String> extractURLs(java.lang.String text)
text
- of the tweet from which to extract URLspublic java.util.List<Extractor.Entity> extractURLsWithIndices(java.lang.String text)
text
- of the tweet from which to extract URLspublic static boolean isValidHostAndLength(int originalUrlLength, java.lang.String protocol, java.lang.String originalHost)
originalUrlLength
- The length of the entire URL, including protocol if anyprotocol
- The protocol usedoriginalHost
- The hostname to check validity ofpublic java.util.List<java.lang.String> extractHashtags(java.lang.String text)
text
- of the tweet from which to extract hashtagspublic java.util.List<Extractor.Entity> extractHashtagsWithIndices(java.lang.String text)
text
- of the tweet from which to extract hashtagspublic java.util.List<java.lang.String> extractCashtags(java.lang.String text)
text
- of the tweet from which to extract cashtagspublic java.util.List<Extractor.Entity> extractCashtagsWithIndices(java.lang.String text)
text
- of the tweet from which to extract cashtagspublic void setExtractURLWithoutProtocol(boolean extractURLWithoutProtocol)
public boolean isExtractURLWithoutProtocol()
public void modifyIndicesFromUnicodeToUTF16(java.lang.String text, java.util.List<Extractor.Entity> entities)
In UTF-16 based indices, Unicode supplementary characters are counted as two characters.
This method requires that the list of entities be in ascending order by start index.
text
- original textentities
- entities with Unicode based indicespublic void modifyIndicesFromUTF16ToUnicode(java.lang.String text, java.util.List<Extractor.Entity> entities)
In Unicode-based indices, Unicode supplementary characters are counted as single characters.
This method requires that the list of entities be in ascending order by start index.
text
- original textentities
- entities with UTF-16 based indices