Package org.htmlunit.cyberneko
Class HTMLScanner
- java.lang.Object
-
- org.htmlunit.cyberneko.HTMLScanner
-
- All Implemented Interfaces:
HTMLComponent,XMLComponent,XMLDocumentSource,XMLLocator,Locator2,Locator
public class HTMLScanner extends Object implements XMLDocumentSource, XMLLocator, HTMLComponent
A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.This component recognizes the following features:
- http://cyberneko.org/html/features/augmentations
- http://cyberneko.org/html/features/report-errors
- http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/script/strip-comment-delims
- http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/style/strip-comment-delims
- http://cyberneko.org/html/features/scanner/ignore-specified-charset
- http://cyberneko.org/html/features/scanner/cdata-sections
- http://cyberneko.org/html/features/scanner/cdata-early-closing
- http://cyberneko.org/html/features/override-doctype
- http://cyberneko.org/html/features/insert-doctype
- http://cyberneko.org/html/features/parse-noscript-content
- http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
- http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
- http://cyberneko.org/html/features/scanner/normalize-attrs
- http://cyberneko.org/html/features/scanner/plain-attr-values
This component recognizes the following properties:
- http://cyberneko.org/html/properties/names/elems
- http://cyberneko.org/html/properties/names/attrs
- http://cyberneko.org/html/properties/default-encoding
- http://cyberneko.org/html/properties/error-reporter
- http://cyberneko.org/html/properties/encoding-translator
- http://cyberneko.org/html/properties/doctype/pubid
- http://cyberneko.org/html/properties/doctype/sysid
- Author:
- Andy Clark, Marc Guillemot, Ahmed Ashour, Ronald Brill, René Schwietzke
- See Also:
HTMLElements
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description classHTMLScanner.ContentScannerThe primary HTML document scanner.classHTMLScanner.PlainTextScannerSpecial scanner used forPLAINTEXTstatic interfaceHTMLScanner.ScannerBasic scanner interface.classHTMLScanner.ScriptScannerSpecial scanner used forPLAINTEXTclassHTMLScanner.SpecialScannerSpecial scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.
-
Field Summary
Fields Modifier and Type Field Description static StringALLOW_SELFCLOSING_IFRAMEAllows self closing <iframe/> tagstatic StringALLOW_SELFCLOSING_SCRIPTAllows self closing <script/> tagstatic StringALLOW_SELFCLOSING_TAGSAllows self closing tags e.g.static StringAUGMENTATIONSInclude infoset augmentations.static StringCDATA_EARLY_CLOSING'>' closes the cdata section (see html spec)static StringCDATA_SECTIONSScan CDATA sections.protected static booleanDEBUG_CALLBACKSSet to true to debug callbacks.protected static intDEFAULT_BUFFER_SIZEstatic StringDEFAULT_ENCODINGDefault encoding.static StringDOCTYPE_PUBIDDoctype declaration public identifier.static StringDOCTYPE_SYSIDDoctype declaration system identifier.static StringENCODING_TRANSLATOREncoding translator.static StringERROR_REPORTERError reporter.protected intfBeginCharacterOffsetBeginning character offset in the file.protected intfBeginColumnNumberBeginning column number.protected intfBeginLineNumberBeginning line number.protected PlaybackInputStreamfByteStreamThe playback byte stream.protected HTMLScanner.ScannerfContentScannerContent scanner.protected MiniStack<org.htmlunit.cyberneko.HTMLScanner.CurrentEntity>fCurrentEntityStackThe current entity stack.protected StringfDefaultIANAEncodingDefault encoding.protected StringfDoctypePubidDoctype declaration public identifier.protected StringfDoctypeSysidDoctype declaration system identifier.protected XMLDocumentHandlerfDocumentHandlerThe document handler.protected intfElementCountElement count.protected intfElementDepthElement depth.protected EncodingTranslatorfEncodingTranslatorError reporter.protected HTMLErrorReporterfErrorReporterError reporter.protected StringfIANAEncodingAuto-detected IANA encoding.protected StringfJavaEncodingAuto-detected Java encoding.protected shortfNamesAttrsModify HTML attribute names.protected shortfNamesElemsModify HTML element names.protected HTMLScanner.ScannerfScannerThe current scanner.protected shortfScannerStateThe current scanner state.protected HTMLScanner.ScriptScannerfScriptScannerSpecial scanner used script tags.protected HTMLScanner.SpecialScannerfSpecialScannerSpecial scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.protected XMLStringfStringBufferString buffer.static StringHTML_4_01_FRAMESET_PUBIDHTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").static StringHTML_4_01_FRAMESET_SYSIDHTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").static StringHTML_4_01_STRICT_PUBIDHTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").static StringHTML_4_01_STRICT_SYSIDHTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").static StringHTML_4_01_TRANSITIONAL_PUBIDHTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").static StringHTML_4_01_TRANSITIONAL_SYSIDHTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").static StringIGNORE_SPECIFIED_CHARSETIgnore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?static StringINSERT_DOCTYPEInsert document type declaration.static StringNAMES_ATTRSModify HTML attribute names: { "upper", "lower", "default" }.static StringNAMES_ELEMSModify HTML element names: { "upper", "lower", "default" }.protected static shortNAMES_LOWERCASELowercase HTML names.protected static shortNAMES_NO_CHANGEDon't modify HTML names.protected static shortNAMES_UPPERCASEUppercase HTML names.static StringNORMALIZE_ATTRIBUTESNormalize attribute values.static StringOVERRIDE_DOCTYPEOverride doctype declaration public and system identifiers.static StringPARSE_NOSCRIPT_CONTENTParse <noscript>...</noscript> contentstatic StringPLAIN_ATTRIBUTE_VALUESStore the plain attribute values also.static StringREPORT_ERRORSReport errors.static StringSCRIPT_STRIP_CDATA_DELIMSStrip XHTML CDATA delimiters ("<!static StringSCRIPT_STRIP_COMMENT_DELIMSStrip HTML comment delimiters ("<!protected static shortSTATE_CONTENTState: content.protected static shortSTATE_END_DOCUMENTState: end document.protected static shortSTATE_MARKUP_BRACKETState: markup bracket.protected static shortSTATE_START_DOCUMENTState: start document.static StringSTYLE_STRIP_CDATA_DELIMSStrip XHTML CDATA delimiters ("<!static StringSTYLE_STRIP_COMMENT_DELIMSStrip HTML comment delimiters ("<!
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidcleanup(boolean closeall)Cleans up used resources.voidevaluateInputSource(XMLInputSource inputSource)Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).protected static StringfixURI(String str)Fixes a platform dependent filename to standard URI form.StringgetBaseSystemId()Returns the base system identifier.intgetCharacterOffset()Returns the character offset.intgetColumnNumber()Returns the current column number.XMLDocumentHandlergetDocumentHandler()Returns the document handler.StringgetEncoding()Returns the encoding.BooleangetFeatureDefault(String featureId)Returns the default state for a feature.intgetLineNumber()Returns the current line number.StringgetLiteralSystemId()Returns the literal system identifier.protected static shortgetNamesValue(String value)ObjectgetPropertyDefault(String propertyId)Returns the default state for a property.StringgetPublicId()Returns the public identifier.String[]getRecognizedFeatures()Returns recognized features.String[]getRecognizedProperties()Returns recognized properties.StringgetSystemId()Returns the expanded system identifier.protected static StringgetValue(XMLAttributes attrs, String aname)StringgetXMLVersion()Returns the XML version.protected AugmentationslocationAugs(org.htmlunit.cyberneko.HTMLScanner.CurrentEntity currentEntity)protected static StringmodifyName(String name, short mode)protected StringnextContent(int len)Reads the next characters WITHOUT impacting the buffer content up to current offset.voidpushInputSource(XMLInputSource inputSource)Pushes an input source onto the current entity stack.protected intreadPreservingBufferContent()voidreset(XMLComponentManager manager)Resets the component.protected voidscanDoctype()booleanscanDocument(boolean complete)Scans a document.protected intscanEntityRef(XMLString str, XMLString plainValue, boolean content)protected StringscanLiteral()protected StringscanName(boolean strict)protected StringscanTagName()voidsetDocumentHandler(XMLDocumentHandler handler)Sets the document handler.voidsetFeature(String featureId, boolean state)Sets a feature.voidsetInputSource(XMLInputSource source)Sets the input source.voidsetProperty(String propertyId, Object value)Sets a property.protected voidsetScanner(HTMLScanner.Scanner scanner)protected voidsetScannerState(short state)protected booleanskip(String s)protected booleanskipMarkup(boolean balance)protected intskipNewlines()protected booleanskipSpaces()protected AugmentationssynthesizedAugs()static StringsystemId(String systemId, String baseSystemId)Expands a system id and returns the system id as a URI, if it can be expanded.
-
-
-
Field Detail
-
HTML_4_01_STRICT_PUBID
public static final String HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").- See Also:
- Constant Field Values
-
HTML_4_01_STRICT_SYSID
public static final String HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_PUBID
public static final String HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_SYSID
public static final String HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_PUBID
public static final String HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_SYSID
public static final String HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").- See Also:
- Constant Field Values
-
AUGMENTATIONS
public static final String AUGMENTATIONS
Include infoset augmentations.- See Also:
- Constant Field Values
-
REPORT_ERRORS
public static final String REPORT_ERRORS
Report errors.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_COMMENT_DELIMS
public static final String SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_CDATA_DELIMS
public static final String SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_COMMENT_DELIMS
public static final String STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_CDATA_DELIMS
public static final String STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.- See Also:
- Constant Field Values
-
IGNORE_SPECIFIED_CHARSET
public static final String IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instruction- See Also:
- Constant Field Values
-
CDATA_SECTIONS
public static final String CDATA_SECTIONS
Scan CDATA sections.- See Also:
- Constant Field Values
-
CDATA_EARLY_CLOSING
public static final String CDATA_EARLY_CLOSING
'>' closes the cdata section (see html spec)- See Also:
- Constant Field Values
-
OVERRIDE_DOCTYPE
public static final String OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.- See Also:
- Constant Field Values
-
INSERT_DOCTYPE
public static final String INSERT_DOCTYPE
Insert document type declaration.- See Also:
- Constant Field Values
-
PARSE_NOSCRIPT_CONTENT
public static final String PARSE_NOSCRIPT_CONTENT
Parse <noscript>...</noscript> content- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_IFRAME
public static final String ALLOW_SELFCLOSING_IFRAME
Allows self closing <iframe/> tag- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_SCRIPT
public static final String ALLOW_SELFCLOSING_SCRIPT
Allows self closing <script/> tag- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_TAGS
public static final String ALLOW_SELFCLOSING_TAGS
Allows self closing tags e.g. <div/> (XHTML)- See Also:
- Constant Field Values
-
NORMALIZE_ATTRIBUTES
public static final String NORMALIZE_ATTRIBUTES
Normalize attribute values.- See Also:
- Constant Field Values
-
PLAIN_ATTRIBUTE_VALUES
public static final String PLAIN_ATTRIBUTE_VALUES
Store the plain attribute values also.- See Also:
- Constant Field Values
-
NAMES_ELEMS
public static final String NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
NAMES_ATTRS
public static final String NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
DEFAULT_ENCODING
public static final String DEFAULT_ENCODING
Default encoding.- See Also:
- Constant Field Values
-
ERROR_REPORTER
public static final String ERROR_REPORTER
Error reporter.- See Also:
- Constant Field Values
-
ENCODING_TRANSLATOR
public static final String ENCODING_TRANSLATOR
Encoding translator.- See Also:
- Constant Field Values
-
DOCTYPE_PUBID
public static final String DOCTYPE_PUBID
Doctype declaration public identifier.- See Also:
- Constant Field Values
-
DOCTYPE_SYSID
public static final String DOCTYPE_SYSID
Doctype declaration system identifier.- See Also:
- Constant Field Values
-
STATE_CONTENT
protected static final short STATE_CONTENT
State: content.- See Also:
- Constant Field Values
-
STATE_MARKUP_BRACKET
protected static final short STATE_MARKUP_BRACKET
State: markup bracket.- See Also:
- Constant Field Values
-
STATE_START_DOCUMENT
protected static final short STATE_START_DOCUMENT
State: start document.- See Also:
- Constant Field Values
-
STATE_END_DOCUMENT
protected static final short STATE_END_DOCUMENT
State: end document.- See Also:
- Constant Field Values
-
NAMES_NO_CHANGE
protected static final short NAMES_NO_CHANGE
Don't modify HTML names.- See Also:
- Constant Field Values
-
NAMES_UPPERCASE
protected static final short NAMES_UPPERCASE
Uppercase HTML names.- See Also:
- Constant Field Values
-
NAMES_LOWERCASE
protected static final short NAMES_LOWERCASE
Lowercase HTML names.- See Also:
- Constant Field Values
-
DEFAULT_BUFFER_SIZE
protected static final int DEFAULT_BUFFER_SIZE
- See Also:
- Constant Field Values
-
DEBUG_CALLBACKS
protected static final boolean DEBUG_CALLBACKS
Set to true to debug callbacks.- See Also:
- Constant Field Values
-
fNamesElems
protected short fNamesElems
Modify HTML element names.
-
fNamesAttrs
protected short fNamesAttrs
Modify HTML attribute names.
-
fDefaultIANAEncoding
protected String fDefaultIANAEncoding
Default encoding.
-
fErrorReporter
protected HTMLErrorReporter fErrorReporter
Error reporter.
-
fEncodingTranslator
protected EncodingTranslator fEncodingTranslator
Error reporter.
-
fDoctypePubid
protected String fDoctypePubid
Doctype declaration public identifier.
-
fDoctypeSysid
protected String fDoctypeSysid
Doctype declaration system identifier.
-
fBeginLineNumber
protected int fBeginLineNumber
Beginning line number.
-
fBeginColumnNumber
protected int fBeginColumnNumber
Beginning column number.
-
fBeginCharacterOffset
protected int fBeginCharacterOffset
Beginning character offset in the file.
-
fByteStream
protected PlaybackInputStream fByteStream
The playback byte stream.
-
fCurrentEntityStack
protected final MiniStack<org.htmlunit.cyberneko.HTMLScanner.CurrentEntity> fCurrentEntityStack
The current entity stack.
-
fScanner
protected HTMLScanner.Scanner fScanner
The current scanner.
-
fScannerState
protected short fScannerState
The current scanner state.
-
fDocumentHandler
protected XMLDocumentHandler fDocumentHandler
The document handler.
-
fIANAEncoding
protected String fIANAEncoding
Auto-detected IANA encoding.
-
fJavaEncoding
protected String fJavaEncoding
Auto-detected Java encoding.
-
fElementCount
protected int fElementCount
Element count.
-
fElementDepth
protected int fElementDepth
Element depth.
-
fContentScanner
protected HTMLScanner.Scanner fContentScanner
Content scanner.
-
fSpecialScanner
protected final HTMLScanner.SpecialScanner fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
-
fScriptScanner
protected final HTMLScanner.ScriptScanner fScriptScanner
Special scanner used script tags.
-
fStringBuffer
protected final XMLString fStringBuffer
String buffer.
-
-
Method Detail
-
pushInputSource
public void pushInputSource(XMLInputSource inputSource)
Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.
- Parameters:
inputSource- The new input source to start scanning.- See Also:
evaluateInputSource(XMLInputSource)
-
evaluateInputSource
public void evaluateInputSource(XMLInputSource inputSource)
Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).- Parameters:
inputSource- The new input source to start evaluating.- See Also:
pushInputSource(XMLInputSource)
-
cleanup
public void cleanup(boolean closeall)
Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.- Parameters:
closeall- Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
-
getEncoding
public String getEncoding()
Returns the encoding.- Specified by:
getEncodingin interfaceLocator2
-
getPublicId
public String getPublicId()
Returns the public identifier.- Specified by:
getPublicIdin interfaceLocator
-
getBaseSystemId
public String getBaseSystemId()
Returns the base system identifier.- Specified by:
getBaseSystemIdin interfaceXMLLocator- Returns:
- the base system identifier.
-
getLiteralSystemId
public String getLiteralSystemId()
Returns the literal system identifier.- Specified by:
getLiteralSystemIdin interfaceXMLLocator- Returns:
- the literal system identifier.
-
getSystemId
public String getSystemId()
Returns the expanded system identifier.- Specified by:
getSystemIdin interfaceLocator
-
getLineNumber
public int getLineNumber()
Returns the current line number.- Specified by:
getLineNumberin interfaceLocator
-
getColumnNumber
public int getColumnNumber()
Returns the current column number.- Specified by:
getColumnNumberin interfaceLocator
-
getXMLVersion
public String getXMLVersion()
Returns the XML version.- Specified by:
getXMLVersionin interfaceLocator2
-
getCharacterOffset
public int getCharacterOffset()
Returns the character offset.- Specified by:
getCharacterOffsetin interfaceXMLLocator- Returns:
- the character offset, or
-1if no character offset is available.
-
getFeatureDefault
public Boolean getFeatureDefault(String featureId)
Returns the default state for a feature.- Specified by:
getFeatureDefaultin interfaceHTMLComponent- Specified by:
getFeatureDefaultin interfaceXMLComponent- Parameters:
featureId- The feature identifier.- Returns:
- the default state for a feature, or null if this component does not want to report a default value for this feature.
-
getPropertyDefault
public Object getPropertyDefault(String propertyId)
Returns the default state for a property.- Specified by:
getPropertyDefaultin interfaceHTMLComponent- Specified by:
getPropertyDefaultin interfaceXMLComponent- Parameters:
propertyId- The property identifier.- Returns:
- the default state for a property, or null if this component does not want to report a default value for this property
-
getRecognizedFeatures
public String[] getRecognizedFeatures()
Returns recognized features.- Specified by:
getRecognizedFeaturesin interfaceXMLComponent- Returns:
- an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
-
getRecognizedProperties
public String[] getRecognizedProperties()
Returns recognized properties.- Specified by:
getRecognizedPropertiesin interfaceXMLComponent- Returns:
- an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
-
reset
public void reset(XMLComponentManager manager) throws XMLConfigurationException
Resets the component.- Specified by:
resetin interfaceXMLComponent- Parameters:
manager- The component manager.- Throws:
XMLConfigurationException
-
setFeature
public void setFeature(String featureId, boolean state)
Sets a feature.- Specified by:
setFeaturein interfaceXMLComponent- Parameters:
featureId- The feature identifier.state- The state of the feature.
-
setProperty
public void setProperty(String propertyId, Object value) throws XMLConfigurationException
Sets a property.- Specified by:
setPropertyin interfaceXMLComponent- Parameters:
propertyId- The property identifier.value- The value of the property.- Throws:
XMLConfigurationException- Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
-
setInputSource
public void setInputSource(XMLInputSource source) throws IOException
Sets the input source.- Parameters:
source- The input source.- Throws:
IOException- Thrown on i/o error.
-
scanDocument
public boolean scanDocument(boolean complete) throws XNIException, IOExceptionScans a document.- Parameters:
complete- True if the scanner should scan the document completely, pushing all events to the registered document handler. A value of false indicates that the scanner should only scan the next portion of the document and return. A scanner instance is permitted to completely scan a document if it does not support this "pull" scanning model.- Returns:
- True if there is more to scan, false otherwise.
- Throws:
IOException- Thrown on i/o error.XNIException- on error.
-
setDocumentHandler
public void setDocumentHandler(XMLDocumentHandler handler)
Sets the document handler.- Specified by:
setDocumentHandlerin interfaceXMLDocumentSource- Parameters:
handler- the new handler
-
getDocumentHandler
public XMLDocumentHandler getDocumentHandler()
Returns the document handler.- Specified by:
getDocumentHandlerin interfaceXMLDocumentSource- Returns:
- the document handler
-
getValue
protected static String getValue(XMLAttributes attrs, String aname)
-
systemId
public static String systemId(String systemId, String baseSystemId)
Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.- Parameters:
systemId- The systemId to be expanded.baseSystemId- baseSystemId- Returns:
- Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
-
fixURI
protected static String fixURI(String str)
Fixes a platform dependent filename to standard URI form.- Parameters:
str- The string to fix.- Returns:
- Returns the fixed URI string.
-
getNamesValue
protected static short getNamesValue(String value)
-
setScanner
protected void setScanner(HTMLScanner.Scanner scanner)
-
setScannerState
protected void setScannerState(short state)
-
scanDoctype
protected void scanDoctype() throws IOException- Throws:
IOException
-
scanLiteral
protected String scanLiteral() throws IOException
- Throws:
IOException
-
scanName
protected String scanName(boolean strict) throws IOException
- Throws:
IOException
-
scanTagName
protected String scanTagName() throws IOException
- Throws:
IOException
-
scanEntityRef
protected int scanEntityRef(XMLString str, XMLString plainValue, boolean content) throws IOException
- Throws:
IOException
-
skip
protected boolean skip(String s) throws IOException
- Throws:
IOException
-
skipMarkup
protected boolean skipMarkup(boolean balance) throws IOException- Throws:
IOException
-
skipSpaces
protected boolean skipSpaces() throws IOException- Throws:
IOException
-
skipNewlines
protected int skipNewlines() throws IOException- Throws:
IOException
-
locationAugs
protected final Augmentations locationAugs(org.htmlunit.cyberneko.HTMLScanner.CurrentEntity currentEntity)
-
synthesizedAugs
protected final Augmentations synthesizedAugs()
-
nextContent
protected String nextContent(int len) throws IOException
Reads the next characters WITHOUT impacting the buffer content up to current offset.- Parameters:
len- the number of characters to read- Returns:
- the read string (length may be smaller if EOF is encountered)
- Throws:
IOException- in case of io problems
-
readPreservingBufferContent
protected int readPreservingBufferContent() throws IOException- Throws:
IOException
-
-