-
- All Implemented Interfaces:
-
java.lang.Comparable
public final class WebPage implements Comparable<WebPage>
Every WebPage is a seperate execution context
Notice: Use a build-in java string or a Utf8 to serialize strings?
see org.apache.gora.hbase.util.HBaseByteInterface#fromBytes
In serialization phrase, a byte array created by s.getBytes(UTF8_CHARSET) is serialized, and in deserialization phrase, every string are wrapped to be a Utf8
So both build-in string and a Utf8 wrap is OK to serialize, and Utf8 is always returned
-
-
Field Summary
Fields Modifier and Type Field Description public final static LoggerLOGpublic static AtomicIntegersequencerpublic static WebPageNILprivate Stringurlprivate StringreversedUrlprivate VolatileConfigconfprivate final Variablesvariablesprivate booleanisCachedprivate booleanisLoadedprivate booleanisFetchedprivate volatile booleanisContentUpdatedprivate volatile ByteBuffertmpContentprivate Stringargs
-
Method Summary
Modifier and Type Method Description StringgetUrl()page.location is the last working address, and page. StringgetReversedUrl()Getter for the field reversedUrl.VolatileConfiggetConf()voidsetConf(@NotNull() VolatileConfig conf)VariablesgetVariables()*****************************************************************************Common fields****************************************************************************** booleanisCached()booleanisLoaded()booleanisFetched()booleanisContentUpdated()ByteBuffergetTmpContent()Get the cached content voidsetTmpContent(ByteBuffer tmpContent)Set the cached content, keep the page content unmodified StringgetArgs()The load arguments is variant task by task, so the local version is the first choice,while the persisted version is used for historical check only voidsetArgs(@NotNull() String args)Set the local args variable and the persist version, and also clear the load options. static WebPagenewWebPage(@NotNull() String url, @NotNull() VolatileConfig conf)newWebPage. static WebPagenewTestWebPage(@NotNull() String url)newWebPage. static WebPagenewWebPage(@NotNull() String url, @NotNull() VolatileConfig conf, @Nullable() String href)newWebPage. static WebPagenewInternalPage(@NotNull() String url)newInternalPage. static WebPagenewInternalPage(@NotNull() String url, @NotNull() String title)newInternalPage. static WebPagenewInternalPage(@NotNull() String url, @NotNull() String title, @NotNull() String content)newInternalPage. static WebPagenewInternalPage(@NotNull() String url, int id, @NotNull() String title, @NotNull() String content)newInternalPage. static WebPagebox(@NotNull() String url, @NotNull() String reversedUrl, @NotNull() GWebPage page, @NotNull() VolatileConfig conf)Initialize a WebPage with the underlying GWebPage instance. static WebPagebox(@NotNull() String url, @NotNull() GWebPage page, @NotNull() VolatileConfig conf)Initialize a WebPage with the underlying GWebPage instance. static WebPagebox(@NotNull() String url, @NotNull() GWebPage page, boolean urlReversed, @NotNull() VolatileConfig conf)Initialize a WebPage with the underlying GWebPage instance. static Utf8wrapKey(@NotNull() Mark mark)*****************************************************************************Other****************************************************************************** static Utf8u8(@Nullable() String value)What's the difference between String and Utf8? StringgetKey()getKey. StringgetHref()Get The hypertext reference of this page. voidsetHref(@Nullable() String href)Set The hypertext reference of this page. booleanisNil()isNil. booleanisNotNil()isNotNil. booleanisInternal()isInternal. booleanisNotInternal()isNotInternal. GWebPageunbox()unbox. voidunsafeSetGPage(@NotNull() GWebPage page)voidunsafeCloneGPage(WebPage page)booleanhasVar(@NotNull() String name)Check if the page scope temporary variable with name {@name} exist ObjectgetVar(@NotNull() String name)Get a page scope temporary variable ObjectremoveVar(@NotNull() String name)getAndRemoveVar. voidsetVar(@NotNull() String name, @NotNull() Object value)Get a page scope temporary variable voidsetCached(boolean cached)voidsetLoaded(boolean loaded)voidsetFetched(boolean fetched)intgetMaxRetries()MetadatagetMetadata()CrawlMarksgetMarks()booleanhasMark(Mark mark)StringgetConfiguredUrl()getConfiguredUrl. intgetFetchedLinkCount()voidsetFetchedLinkCount(int count)ZoneIdgetZoneId()voidsetZoneId(@NotNull() ZoneId zoneId)StringgetBatchId()voidsetBatchId(String value)voidmarkSeed()voidunmarkSeed()booleanisSeed()intgetDistance()voidsetDistance(int newDistance)voidupdateDistance(int newDistance)FetchModegetFetchMode()voidsetFetchMode(@NotNull() FetchMode mode)Fetch mode is used to determine the protocol before fetch, so it shall be set before fetch BrowserTypegetLastBrowser()voidsetLastBrowser(@NotNull() BrowserType browser)HtmlIntegritygetHtmlIntegrity()voidsetHtmlIntegrity(@NotNull() HtmlIntegrity integrity)intgetFetchPriority()voidsetFetchPriority(int priority)intsniffFetchPriority()InstantgetCreateTime()voidsetCreateTime(@NotNull() Instant createTime)InstantgetGenerateTime()voidsetGenerateTime(@NotNull() Instant generateTime)InstantgetModelSyncTime()voidsetModelSyncTime(@Nullable() Instant modelSyncTime)intgetFetchCount()voidsetFetchCount(int count)voidupdateFetchCount()CrawlStatusgetCrawlStatus()voidsetCrawlStatus(@NotNull() CrawlStatus crawlStatus)voidsetCrawlStatus(int value)StringgetBaseUrl()The baseUrl is as the same as LocationA baseUrl has the same semantic with Jsoup. StringgetLocation()WebPage.url is the permanent internal address, it might not still available to access the target.And WebPage.location or WebPage.baseUrl is the last working address, it might redirect to url,or it might have additional random parameters.WebPage.location may be different from url, it's generally normalized. voidsetLocation(@NotNull() String value)The url is the permanent internal address, it might not still available to access the target. InstantgetFetchTime()The latest fetch time voidsetFetchTime(@NotNull() Instant time)The latest fetch time InstantgetPrevFetchTime()The previous fetch time, updated at the fetch stage voidsetPrevFetchTime(@NotNull() Instant time)InstantgetPrevCrawlTime1()The previous crawl time, used for fat link crawl, which means both the page itself and out pages are fetched voidsetPrevCrawlTime1(@NotNull() Instant time)The previous crawl time, used for fat link crawl, which means both the page itself and out pages are fetched DurationgetFetchInterval()voidsetFetchInterval(@NotNull() Duration duration)voidsetFetchInterval(long interval)voidsetFetchInterval(float interval)ProtocolStatusgetProtocolStatus()voidsetProtocolStatus(@NotNull() ProtocolStatus protocolStatus)ProtocolHeadersgetHeaders()Header information returned from the web server used to server the content which is subsequently fetched from.This includes keys such asTRANSFER_ENCODING,CONTENT_ENCODING,CONTENT_LANGUAGE,CONTENT_LENGTH,CONTENT_LOCATION,CONTENT_DISPOSITION,CONTENT_MD5,CONTENT_TYPE,LAST_MODIFIEDand LOCATION. StringgetReprUrl()voidsetReprUrl(@NotNull() String value)intgetFetchRetries()voidsetFetchRetries(int value)DurationgetLastTimeout()InstantgetModifiedTime()voidsetModifiedTime(@NotNull() Instant value)InstantgetPrevModifiedTime()voidsetPrevModifiedTime(@NotNull() Instant value)InstantsniffModifiedTime()StringgetFetchTimeHistory(@NotNull() String defaultValue)voidupdateFetchTimeHistory(@NotNull() Instant fetchTime)*****************************************************************************Parsing****************************************************************************** InstantgetFirstFetchTime()Get the first fetch time StringgetNamespace()namespace : metadata, seed, wwwreserved voidsetNamespace(String ns)reserved PageCategorygetPageCategory()getPageCategory. voidsetPageCategory(@NotNull() PageCategory pageCategory)category : index, detail, review, media, search, etc voidsetPageCategory(@NotNull() OpenPageCategory pageCategory)StringgetEncoding()getEncoding. voidsetEncoding(@Nullable() String encoding)setEncoding. StringgetEncodingOrDefault(@NotNull() String defaultEncoding)Get content encodingContent encoding is detected just before it's parsed StringgetEncodingClues()getEncodingClues. voidsetEncodingClues(@NotNull() String clues)setEncodingClues. ByteBuffergetContent()The entire raw document content e.g. ByteBuffergetPersistContent()Get the uncached content Array<byte>getContentAsBytes()Get content as bytes, the underling buffer is duplicated StringgetContentAsString()TODO: Encoding is always UTF-8? ByteArrayInputStreamgetContentAsInputStream()Get the page content as input stream InputSourcegetContentAsSaxInputSource()Get the page content as sax input source voidsetContent(@Nullable() String value)Set the page content voidsetContent(@Nullable() Array<byte> value)Set the page content voidsetContent(@Nullable() ByteBuffer value)Set the page content voidclearPersistContent()longgetContentLength()TODO: check consistency with HttpHeaders. longgetPersistContentLength()longgetLastContentBytes()longgetAveContentBytes()getAveContentBytes. StringgetContentType()getContentType. voidsetContentType(String value)setContentType. ByteBuffergetPrevSignature()getPrevSignature. voidsetPrevSignature(@Nullable() ByteBuffer value)setPrevSignature. StringgetPrevSignatureAsString()getPrevSignatureAsString. StringgetProxy()getProxy. voidsetProxy(@Nullable() String proxy)setProxy. ActiveDomStatusgetActiveDomStatus()voidsetActiveDomStatus(ActiveDomStatus s)Map<String, ActiveDomStat>getActiveDomStats()voidsetActiveDomStats(@NotNull() Map<String, ActiveDomStat> stats)ActiveDomUrlsgetActiveDomUrls()voidsetActiveDomUrls(@NotNull() ActiveDomUrls urls)setActiveDomUrls. ByteBuffergetSignature()An implementation of a WebPage's signature from which it can be identified and referenced at any point in time.This is essentially the WebPage's fingerprint representing its state for any point in time. voidsetSignature(Array<byte> value)setSignature. StringgetSignatureAsString()getSignatureAsString. StringgetPageTitle()voidsetPageTitle(String pageTitle)StringgetContentTitle()voidsetContentTitle(String contentTitle)StringsniffTitle()StringgetPageText()voidsetPageText(String value)StringgetContentText()voidsetContentText(String textContent)intgetContentTextLen()voidsetTextCascaded(String text)ParseStatusgetParseStatus()voidsetParseStatus(ParseStatus parseStatus)Map<CharSequence, GHypeLink>getLiveLinks()Collection<String>getSimpleLiveLinks()voidsetLiveLinks(Iterable<HyperlinkPersistable> liveLinks)TODO: Remove redundant url to reduce space voidsetLiveLinks(Map<CharSequence, GHypeLink> links)voidaddLiveLink(HyperlinkPersistable hyperLink)Map<CharSequence, CharSequence>getVividLinks()Collection<String>getSimpleVividLinks()voidsetVividLinks(Map<CharSequence, CharSequence> links)List<CharSequence>getDeadLinks()voidsetDeadLinks(List<CharSequence> deadLinks)List<CharSequence>getLinks()voidsetLinks(List<CharSequence> links)voidaddHyperlinks(Iterable<HyperlinkPersistable> hypeLinks)Record all links appeared in a pageThe links are in FIFO order, for each time we fetch and parse a page,we push newly discovered links to the queue, if the queue is full, we drop out some old ones,usually they do not appears in the page any more. voidaddLinks(Iterable<CharSequence> hypeLinks)addLinks. intgetImpreciseLinkCount()getImpreciseLinkCount. voidsetImpreciseLinkCount(int count)setImpreciseLinkCount. voidincreaseImpreciseLinkCount(int count)increaseImpreciseLinkCount. Map<CharSequence, CharSequence>getInlinks()getInlinks. CharSequencegetAnchor()getAnchor. voidsetAnchor(CharSequence anchor)Anchor can be used to sniff article title Array<String>getInlinkAnchors()voidsetInlinkAnchors(Collection<CharSequence> anchors)intgetAnchorOrder()voidsetAnchorOrder(int order)InstantgetContentPublishTime()voidsetContentPublishTime(Instant publishTime)booleanupdateContentPublishTime(Instant newPublishTime)InstantgetPrevContentPublishTime()voidsetPrevContentPublishTime(Instant publishTime)InstantgetRefContentPublishTime()voidsetRefContentPublishTime(Instant publishTime)InstantgetContentModifiedTime()voidsetContentModifiedTime(Instant modifiedTime)InstantgetPrevContentModifiedTime()voidsetPrevContentModifiedTime(Instant modifiedTime)booleanupdateContentModifiedTime(Instant newModifiedTime)InstantgetPrevRefContentPublishTime()voidsetPrevRefContentPublishTime(Instant publishTime)booleanupdateRefContentPublishTime(Instant newRefPublishTime)StringgetReferrer()getReferrer. voidsetReferrer(String referrer)setReferrer. PageModelgetPageModel()*****************************************************************************Page Model****************************************************************************** floatgetScore()*****************************************************************************Scoring****************************************************************************** voidsetScore(float value)setScore. floatgetContentScore()getContentScore. voidsetContentScore(float score)setContentScore. StringgetSortScore()getSortScore. voidsetSortScore(String score)setSortScore. floatgetCash()getCash. voidsetCash(float cash)setCash. PageCountersgetPageCounters()getPageCounters. StringgetIndexTimeHistory(String defaultValue)*****************************************************************************Index****************************************************************************** voidputIndexTimeHistory(Instant indexTime)putIndexTimeHistory. InstantgetFirstIndexTime(Instant defaultValue)getFirstIndexTime. inthashCode()intcompareTo(@NotNull() WebPage o)booleanequals(Object other)StringtoString()-
-
Method Detail
-
getUrl
@NotNull() String getUrl()
page.location is the last working address, and page.url is the permanent internal address
-
getReversedUrl
@NotNull() String getReversedUrl()
Getter for the field
reversedUrl.
-
getConf
@NotNull() VolatileConfig getConf()
-
setConf
void setConf(@NotNull() VolatileConfig conf)
-
getVariables
@NotNull() Variables getVariables()
*****************************************************************************Common fields******************************************************************************
-
isCached
boolean isCached()
-
isLoaded
boolean isLoaded()
-
isFetched
boolean isFetched()
-
isContentUpdated
boolean isContentUpdated()
-
getTmpContent
@Nullable() ByteBuffer getTmpContent()
Get the cached content
-
setTmpContent
void setTmpContent(ByteBuffer tmpContent)
Set the cached content, keep the page content unmodified
-
getArgs
@NotNull() String getArgs()
The load arguments is variant task by task, so the local version is the first choice,while the persisted version is used for historical check only
-
setArgs
void setArgs(@NotNull() String args)
Set the local args variable and the persist version, and also clear the load options.
-
newWebPage
@NotNull() static WebPage newWebPage(@NotNull() String url, @NotNull() VolatileConfig conf)
newWebPage.
- Parameters:
url- a java.lang.String object.conf- a ai.platon.pulsar.common.config.VolatileConfig object.
-
newTestWebPage
@NotNull() static WebPage newTestWebPage(@NotNull() String url)
newWebPage.
- Parameters:
url- a java.lang.String object.
-
newWebPage
@NotNull() static WebPage newWebPage(@NotNull() String url, @NotNull() VolatileConfig conf, @Nullable() String href)
newWebPage.
- Parameters:
url- a java.lang.String object.conf- a ai.platon.pulsar.common.config.VolatileConfig object.
-
newInternalPage
@NotNull() static WebPage newInternalPage(@NotNull() String url)
newInternalPage.
- Parameters:
url- a java.lang.String object.
-
newInternalPage
@NotNull() static WebPage newInternalPage(@NotNull() String url, @NotNull() String title)
newInternalPage.
- Parameters:
url- a java.lang.String object.title- a java.lang.String object.
-
newInternalPage
@NotNull() static WebPage newInternalPage(@NotNull() String url, @NotNull() String title, @NotNull() String content)
newInternalPage.
- Parameters:
url- a java.lang.String object.title- a java.lang.String object.content- a java.lang.String object.
-
newInternalPage
@NotNull() static WebPage newInternalPage(@NotNull() String url, int id, @NotNull() String title, @NotNull() String content)
newInternalPage.
- Parameters:
url- a java.lang.String object.id- a int.title- a java.lang.String object.content- a java.lang.String object.
-
box
@NotNull() static WebPage box(@NotNull() String url, @NotNull() String reversedUrl, @NotNull() GWebPage page, @NotNull() VolatileConfig conf)
Initialize a WebPage with the underlying GWebPage instance.
- Parameters:
url- a java.lang.String object.reversedUrl- a java.lang.String object.page- a ai.platon.pulsar.persist.gora.generated.GWebPage object.
-
box
@NotNull() static WebPage box(@NotNull() String url, @NotNull() GWebPage page, @NotNull() VolatileConfig conf)
Initialize a WebPage with the underlying GWebPage instance.
- Parameters:
url- a java.lang.String object.page- a ai.platon.pulsar.persist.gora.generated.GWebPage object.
-
box
@NotNull() static WebPage box(@NotNull() String url, @NotNull() GWebPage page, boolean urlReversed, @NotNull() VolatileConfig conf)
Initialize a WebPage with the underlying GWebPage instance.
- Parameters:
url- a java.lang.String object.page- a ai.platon.pulsar.persist.gora.generated.GWebPage object.urlReversed- a boolean.
-
wrapKey
@NotNull() static Utf8 wrapKey(@NotNull() Mark mark)
*****************************************************************************Other******************************************************************************
- Parameters:
mark- a ai.platon.pulsar.persist.metadata.Mark object.
-
u8
@Nullable() static Utf8 u8(@Nullable() String value)
What's the difference between String and Utf8?
- Parameters:
value- a java.lang.String object.
-
getHref
@Nullable() String getHref()
Get The hypertext reference of this page.It defines the address of the document, which this time is linked from
TODO: use a seperate field to hold href
-
setHref
void setHref(@Nullable() String href)
Set The hypertext reference of this page.It defines the address of the document, which this time is linked from
- Parameters:
href- The hypertext reference
-
isNil
boolean isNil()
isNil.
-
isNotNil
boolean isNotNil()
isNotNil.
-
isInternal
boolean isInternal()
isInternal.
-
isNotInternal
boolean isNotInternal()
isNotInternal.
-
unsafeSetGPage
void unsafeSetGPage(@NotNull() GWebPage page)
-
unsafeCloneGPage
void unsafeCloneGPage(WebPage page)
-
hasVar
boolean hasVar(@NotNull() String name)
Check if the page scope temporary variable with name {@name} exist
- Parameters:
name- The variable name to check
-
getVar
Object getVar(@NotNull() String name)
Get a page scope temporary variable
- Parameters:
name- a String object.
-
removeVar
Object removeVar(@NotNull() String name)
getAndRemoveVar.
- Parameters:
name- a java.lang.String object.
-
setVar
void setVar(@NotNull() String name, @NotNull() Object value)
Get a page scope temporary variable
- Parameters:
name- The variable name.value- The variable value.
-
setCached
void setCached(boolean cached)
-
setLoaded
void setLoaded(boolean loaded)
-
setFetched
void setFetched(boolean fetched)
-
getMaxRetries
int getMaxRetries()
-
getMetadata
Metadata getMetadata()
-
getMarks
CrawlMarks getMarks()
-
getConfiguredUrl
@NotNull() String getConfiguredUrl()
getConfiguredUrl.
-
getFetchedLinkCount
int getFetchedLinkCount()
-
setFetchedLinkCount
void setFetchedLinkCount(int count)
-
getBatchId
String getBatchId()
-
setBatchId
void setBatchId(String value)
-
markSeed
void markSeed()
-
unmarkSeed
void unmarkSeed()
-
isSeed
boolean isSeed()
-
getDistance
int getDistance()
-
setDistance
void setDistance(int newDistance)
-
updateDistance
void updateDistance(int newDistance)
-
getFetchMode
@NotNull() FetchMode getFetchMode()
-
setFetchMode
void setFetchMode(@NotNull() FetchMode mode)
Fetch mode is used to determine the protocol before fetch, so it shall be set before fetch
-
getLastBrowser
@NotNull() BrowserType getLastBrowser()
-
setLastBrowser
void setLastBrowser(@NotNull() BrowserType browser)
-
getHtmlIntegrity
@NotNull() HtmlIntegrity getHtmlIntegrity()
-
setHtmlIntegrity
void setHtmlIntegrity(@NotNull() HtmlIntegrity integrity)
-
getFetchPriority
int getFetchPriority()
-
setFetchPriority
void setFetchPriority(int priority)
-
sniffFetchPriority
int sniffFetchPriority()
-
getCreateTime
@NotNull() Instant getCreateTime()
-
setCreateTime
void setCreateTime(@NotNull() Instant createTime)
-
getGenerateTime
@NotNull() Instant getGenerateTime()
-
setGenerateTime
void setGenerateTime(@NotNull() Instant generateTime)
-
getModelSyncTime
@Nullable() Instant getModelSyncTime()
-
setModelSyncTime
void setModelSyncTime(@Nullable() Instant modelSyncTime)
-
getFetchCount
int getFetchCount()
-
setFetchCount
void setFetchCount(int count)
-
updateFetchCount
void updateFetchCount()
-
getCrawlStatus
@NotNull() CrawlStatus getCrawlStatus()
-
setCrawlStatus
void setCrawlStatus(@NotNull() CrawlStatus crawlStatus)
-
setCrawlStatus
void setCrawlStatus(int value)
-
getBaseUrl
String getBaseUrl()
The baseUrl is as the same as Location
A baseUrl has the same semantic with Jsoup.parse:
-
getLocation
String getLocation()
WebPage.url is the permanent internal address, it might not still available to access the target.And WebPage.location or WebPage.baseUrl is the last working address, it might redirect to url,or it might have additional random parameters.WebPage.location may be different from url, it's generally normalized.
-
setLocation
void setLocation(@NotNull() String value)
The url is the permanent internal address, it might not still available to access the target.
Location is the last working address, it might redirect to url, or it might have additional random parameters.
Location may be different from url, it's generally normalized.
- Parameters:
value- The location.
-
getFetchTime
@NotNull() Instant getFetchTime()
The latest fetch time
-
setFetchTime
void setFetchTime(@NotNull() Instant time)
The latest fetch time
- Parameters:
time- The latest fetch time
-
getPrevFetchTime
@NotNull() Instant getPrevFetchTime()
The previous fetch time, updated at the fetch stage
-
setPrevFetchTime
void setPrevFetchTime(@NotNull() Instant time)
-
getPrevCrawlTime1
@NotNull() Instant getPrevCrawlTime1()
The previous crawl time, used for fat link crawl, which means both the page itself and out pages are fetched
-
setPrevCrawlTime1
void setPrevCrawlTime1(@NotNull() Instant time)
The previous crawl time, used for fat link crawl, which means both the page itself and out pages are fetched
-
getFetchInterval
@NotNull() Duration getFetchInterval()
-
setFetchInterval
void setFetchInterval(@NotNull() Duration duration)
-
setFetchInterval
void setFetchInterval(long interval)
-
setFetchInterval
void setFetchInterval(float interval)
-
getProtocolStatus
@NotNull() ProtocolStatus getProtocolStatus()
-
setProtocolStatus
void setProtocolStatus(@NotNull() ProtocolStatus protocolStatus)
-
getHeaders
@NotNull() ProtocolHeaders getHeaders()
Header information returned from the web server used to server the content which is subsequently fetched from.This includes keys such asTRANSFER_ENCODING,CONTENT_ENCODING,CONTENT_LANGUAGE,CONTENT_LENGTH,CONTENT_LOCATION,CONTENT_DISPOSITION,CONTENT_MD5,CONTENT_TYPE,LAST_MODIFIEDand LOCATION.
-
getReprUrl
@NotNull() String getReprUrl()
-
setReprUrl
void setReprUrl(@NotNull() String value)
-
getFetchRetries
int getFetchRetries()
-
setFetchRetries
void setFetchRetries(int value)
-
getLastTimeout
@NotNull() Duration getLastTimeout()
-
getModifiedTime
@NotNull() Instant getModifiedTime()
-
setModifiedTime
void setModifiedTime(@NotNull() Instant value)
-
getPrevModifiedTime
@NotNull() Instant getPrevModifiedTime()
-
setPrevModifiedTime
void setPrevModifiedTime(@NotNull() Instant value)
-
sniffModifiedTime
@NotNull() Instant sniffModifiedTime()
-
getFetchTimeHistory
@NotNull() String getFetchTimeHistory(@NotNull() String defaultValue)
-
updateFetchTimeHistory
void updateFetchTimeHistory(@NotNull() Instant fetchTime)
*****************************************************************************Parsing******************************************************************************
-
getFirstFetchTime
@Nullable() Instant getFirstFetchTime()
Get the first fetch time
-
getNamespace
@Nullable() String getNamespace()
namespace : metadata, seed, wwwreserved
-
setNamespace
void setNamespace(String ns)
reserved
- Parameters:
ns- a java.lang.String object.
-
getPageCategory
@NotNull() PageCategory getPageCategory()
getPageCategory.
-
setPageCategory
void setPageCategory(@NotNull() PageCategory pageCategory)
category : index, detail, review, media, search, etc
- Parameters:
pageCategory- a ai.platon.pulsar.persist.metadata.PageCategory object.
-
setPageCategory
void setPageCategory(@NotNull() OpenPageCategory pageCategory)
-
getEncoding
@Nullable() String getEncoding()
getEncoding.
-
setEncoding
void setEncoding(@Nullable() String encoding)
setEncoding.
- Parameters:
encoding- a java.lang.String object.
-
getEncodingOrDefault
@NotNull() String getEncodingOrDefault(@NotNull() String defaultEncoding)
Get content encodingContent encoding is detected just before it's parsed
- Parameters:
defaultEncoding- a java.lang.String object.
-
getEncodingClues
@NotNull() String getEncodingClues()
getEncodingClues.
-
setEncodingClues
void setEncodingClues(@NotNull() String clues)
setEncodingClues.
- Parameters:
clues- a java.lang.String object.
-
getContent
@Nullable() ByteBuffer getContent()
The entire raw document content e.g. raw XHTML
-
getPersistContent
@Nullable() ByteBuffer getPersistContent()
Get the uncached content
-
getContentAsBytes
@NotNull() Array<byte> getContentAsBytes()
Get content as bytes, the underling buffer is duplicated
-
getContentAsString
@NotNull() String getContentAsString()
TODO: Encoding is always UTF-8?
Get the page content as a string
-
getContentAsInputStream
@NotNull() ByteArrayInputStream getContentAsInputStream()
Get the page content as input stream
-
getContentAsSaxInputSource
@NotNull() InputSource getContentAsSaxInputSource()
Get the page content as sax input source
-
setContent
void setContent(@Nullable() String value)
Set the page content
-
setContent
void setContent(@Nullable() Array<byte> value)
Set the page content
-
setContent
void setContent(@Nullable() ByteBuffer value)
Set the page content
- Parameters:
value- a ByteBuffer.
-
clearPersistContent
void clearPersistContent()
-
getContentLength
long getContentLength()
TODO: check consistency with HttpHeaders.CONTENT_LENGTH
-
getPersistContentLength
long getPersistContentLength()
-
getLastContentBytes
long getLastContentBytes()
-
getAveContentBytes
long getAveContentBytes()
getAveContentBytes.
-
getContentType
@NotNull() String getContentType()
getContentType.
-
setContentType
void setContentType(String value)
setContentType.
- Parameters:
value- a java.lang.String object.
-
getPrevSignature
@Nullable() ByteBuffer getPrevSignature()
getPrevSignature.
-
setPrevSignature
void setPrevSignature(@Nullable() ByteBuffer value)
setPrevSignature.
- Parameters:
value- a java.nio.ByteBuffer object.
-
getPrevSignatureAsString
@NotNull() String getPrevSignatureAsString()
getPrevSignatureAsString.
-
setProxy
void setProxy(@Nullable() String proxy)
setProxy.
- Parameters:
proxy- a java.lang.String object.
-
getActiveDomStatus
@Nullable() ActiveDomStatus getActiveDomStatus()
-
setActiveDomStatus
void setActiveDomStatus(ActiveDomStatus s)
-
getActiveDomStats
@NotNull() Map<String, ActiveDomStat> getActiveDomStats()
-
setActiveDomStats
void setActiveDomStats(@NotNull() Map<String, ActiveDomStat> stats)
-
getActiveDomUrls
@NotNull() ActiveDomUrls getActiveDomUrls()
-
setActiveDomUrls
void setActiveDomUrls(@NotNull() ActiveDomUrls urls)
setActiveDomUrls.
- Parameters:
urls- a ai.platon.pulsar.persist.model.ActiveDomUrls object.
-
getSignature
@Nullable() ByteBuffer getSignature()
An implementation of a WebPage's signature from which it can be identified and referenced at any point in time.This is essentially the WebPage's fingerprint representing its state for any point in time.
-
setSignature
void setSignature(Array<byte> value)
setSignature.
- Parameters:
value- an array of objects.
-
getSignatureAsString
@NotNull() String getSignatureAsString()
getSignatureAsString.
-
getPageTitle
@NotNull() String getPageTitle()
-
setPageTitle
void setPageTitle(String pageTitle)
-
getContentTitle
@NotNull() String getContentTitle()
-
setContentTitle
void setContentTitle(String contentTitle)
-
sniffTitle
@NotNull() String sniffTitle()
-
getPageText
@NotNull() String getPageText()
-
setPageText
void setPageText(String value)
-
getContentText
@NotNull() String getContentText()
-
setContentText
void setContentText(String textContent)
-
getContentTextLen
int getContentTextLen()
-
setTextCascaded
void setTextCascaded(String text)
-
getParseStatus
@NotNull() ParseStatus getParseStatus()
-
setParseStatus
void setParseStatus(ParseStatus parseStatus)
-
getLiveLinks
Map<CharSequence, GHypeLink> getLiveLinks()
-
getSimpleLiveLinks
Collection<String> getSimpleLiveLinks()
-
setLiveLinks
void setLiveLinks(Iterable<HyperlinkPersistable> liveLinks)
TODO: Remove redundant url to reduce space
- Parameters:
liveLinks- a java.lang.Iterable object.
-
setLiveLinks
void setLiveLinks(Map<CharSequence, GHypeLink> links)
-
addLiveLink
void addLiveLink(HyperlinkPersistable hyperLink)
-
getVividLinks
Map<CharSequence, CharSequence> getVividLinks()
-
getSimpleVividLinks
Collection<String> getSimpleVividLinks()
-
setVividLinks
void setVividLinks(Map<CharSequence, CharSequence> links)
-
getDeadLinks
List<CharSequence> getDeadLinks()
-
setDeadLinks
void setDeadLinks(List<CharSequence> deadLinks)
-
getLinks
List<CharSequence> getLinks()
-
setLinks
void setLinks(List<CharSequence> links)
-
addHyperlinks
void addHyperlinks(Iterable<HyperlinkPersistable> hypeLinks)
Record all links appeared in a pageThe links are in FIFO order, for each time we fetch and parse a page,we push newly discovered links to the queue, if the queue is full, we drop out some old ones,usually they do not appears in the page any more.
TODO: compress linksTODO: HBase seems not modify any nested array
- Parameters:
hypeLinks- a java.lang.Iterable object.
-
addLinks
void addLinks(Iterable<CharSequence> hypeLinks)
addLinks.
- Parameters:
hypeLinks- a java.lang.Iterable object.
-
getImpreciseLinkCount
int getImpreciseLinkCount()
getImpreciseLinkCount.
-
setImpreciseLinkCount
void setImpreciseLinkCount(int count)
setImpreciseLinkCount.
- Parameters:
count- a int.
-
increaseImpreciseLinkCount
void increaseImpreciseLinkCount(int count)
increaseImpreciseLinkCount.
- Parameters:
count- a int.
-
getInlinks
Map<CharSequence, CharSequence> getInlinks()
getInlinks.
-
getAnchor
@NotNull() CharSequence getAnchor()
getAnchor.
-
setAnchor
void setAnchor(CharSequence anchor)
Anchor can be used to sniff article title
- Parameters:
anchor- a java.lang.CharSequence object.
-
getInlinkAnchors
Array<String> getInlinkAnchors()
-
setInlinkAnchors
void setInlinkAnchors(Collection<CharSequence> anchors)
-
getAnchorOrder
int getAnchorOrder()
-
setAnchorOrder
void setAnchorOrder(int order)
-
getContentPublishTime
Instant getContentPublishTime()
-
setContentPublishTime
void setContentPublishTime(Instant publishTime)
-
updateContentPublishTime
boolean updateContentPublishTime(Instant newPublishTime)
-
getPrevContentPublishTime
Instant getPrevContentPublishTime()
-
setPrevContentPublishTime
void setPrevContentPublishTime(Instant publishTime)
-
getRefContentPublishTime
Instant getRefContentPublishTime()
-
setRefContentPublishTime
void setRefContentPublishTime(Instant publishTime)
-
getContentModifiedTime
Instant getContentModifiedTime()
-
setContentModifiedTime
void setContentModifiedTime(Instant modifiedTime)
-
getPrevContentModifiedTime
Instant getPrevContentModifiedTime()
-
setPrevContentModifiedTime
void setPrevContentModifiedTime(Instant modifiedTime)
-
updateContentModifiedTime
boolean updateContentModifiedTime(Instant newModifiedTime)
-
getPrevRefContentPublishTime
Instant getPrevRefContentPublishTime()
-
setPrevRefContentPublishTime
void setPrevRefContentPublishTime(Instant publishTime)
-
updateRefContentPublishTime
boolean updateRefContentPublishTime(Instant newRefPublishTime)
-
getReferrer
@NotNull() String getReferrer()
getReferrer.
-
setReferrer
void setReferrer(String referrer)
setReferrer.
- Parameters:
referrer- a java.lang.String object.
-
getPageModel
@NotNull() PageModel getPageModel()
*****************************************************************************Page Model******************************************************************************
-
getScore
float getScore()
*****************************************************************************Scoring******************************************************************************
-
setScore
void setScore(float value)
setScore.
- Parameters:
value- a float.
-
getContentScore
float getContentScore()
getContentScore.
-
setContentScore
void setContentScore(float score)
setContentScore.
- Parameters:
score- a float.
-
getSortScore
@NotNull() String getSortScore()
getSortScore.
-
setSortScore
void setSortScore(String score)
setSortScore.
- Parameters:
score- a java.lang.String object.
-
getCash
float getCash()
getCash.
-
setCash
void setCash(float cash)
setCash.
- Parameters:
cash- a float.
-
getPageCounters
@NotNull() PageCounters getPageCounters()
getPageCounters.
-
getIndexTimeHistory
String getIndexTimeHistory(String defaultValue)
*****************************************************************************Index******************************************************************************
- Parameters:
defaultValue- a java.lang.String object.
-
putIndexTimeHistory
void putIndexTimeHistory(Instant indexTime)
putIndexTimeHistory.
- Parameters:
indexTime- a java.time.Instant object.
-
getFirstIndexTime
Instant getFirstIndexTime(Instant defaultValue)
getFirstIndexTime.
- Parameters:
defaultValue- a java.time.Instant object.
-
hashCode
int hashCode()
-
-
-
-