Class HTMLScanner

  • All Implemented Interfaces:
    HTMLComponent, XMLComponent, XMLDocumentSource, XMLLocator, Locator2, Locator

    public class HTMLScanner
    extends Object
    implements XMLDocumentSource, XMLLocator, HTMLComponent
    A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.

    This component recognizes the following features:

    • http://cyberneko.org/html/features/augmentations
    • http://cyberneko.org/html/features/report-errors
    • http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
    • http://cyberneko.org/html/features/scanner/script/strip-comment-delims
    • http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
    • http://cyberneko.org/html/features/scanner/style/strip-comment-delims
    • http://cyberneko.org/html/features/scanner/ignore-specified-charset
    • http://cyberneko.org/html/features/scanner/cdata-sections
    • http://cyberneko.org/html/features/scanner/cdata-early-closing
    • http://cyberneko.org/html/features/override-doctype
    • http://cyberneko.org/html/features/insert-doctype
    • http://cyberneko.org/html/features/parse-noscript-content
    • http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
    • http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
    • http://cyberneko.org/html/features/scanner/normalize-attrs
    • http://cyberneko.org/html/features/scanner/plain-attr-values

    This component recognizes the following properties:

    • http://cyberneko.org/html/properties/names/elems
    • http://cyberneko.org/html/properties/names/attrs
    • http://cyberneko.org/html/properties/default-encoding
    • http://cyberneko.org/html/properties/error-reporter
    • http://cyberneko.org/html/properties/encoding-translator
    • http://cyberneko.org/html/properties/doctype/pubid
    • http://cyberneko.org/html/properties/doctype/sysid
    Author:
    Andy Clark, Marc Guillemot, Ahmed Ashour, Ronald Brill, René Schwietzke
    See Also:
    HTMLElements
    • Field Detail

      • HTML_4_01_STRICT_PUBID

        public static final String HTML_4_01_STRICT_PUBID
        HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_STRICT_SYSID

        public static final String HTML_4_01_STRICT_SYSID
        HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").
        See Also:
        Constant Field Values
      • HTML_4_01_TRANSITIONAL_PUBID

        public static final String HTML_4_01_TRANSITIONAL_PUBID
        HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_TRANSITIONAL_SYSID

        public static final String HTML_4_01_TRANSITIONAL_SYSID
        HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").
        See Also:
        Constant Field Values
      • HTML_4_01_FRAMESET_PUBID

        public static final String HTML_4_01_FRAMESET_PUBID
        HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_FRAMESET_SYSID

        public static final String HTML_4_01_FRAMESET_SYSID
        HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").
        See Also:
        Constant Field Values
      • SCRIPT_STRIP_COMMENT_DELIMS

        public static final String SCRIPT_STRIP_COMMENT_DELIMS
        Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.
        See Also:
        Constant Field Values
      • SCRIPT_STRIP_CDATA_DELIMS

        public static final String SCRIPT_STRIP_CDATA_DELIMS
        Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.
        See Also:
        Constant Field Values
      • STYLE_STRIP_COMMENT_DELIMS

        public static final String STYLE_STRIP_COMMENT_DELIMS
        Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.
        See Also:
        Constant Field Values
      • STYLE_STRIP_CDATA_DELIMS

        public static final String STYLE_STRIP_CDATA_DELIMS
        Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.
        See Also:
        Constant Field Values
      • IGNORE_SPECIFIED_CHARSET

        public static final String IGNORE_SPECIFIED_CHARSET
        Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instruction
        See Also:
        Constant Field Values
      • CDATA_EARLY_CLOSING

        public static final String CDATA_EARLY_CLOSING
        '>' closes the cdata section (see html spec)
        See Also:
        Constant Field Values
      • OVERRIDE_DOCTYPE

        public static final String OVERRIDE_DOCTYPE
        Override doctype declaration public and system identifiers.
        See Also:
        Constant Field Values
      • PARSE_NOSCRIPT_CONTENT

        public static final String PARSE_NOSCRIPT_CONTENT
        Parse <noscript>...</noscript> content
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_IFRAME

        public static final String ALLOW_SELFCLOSING_IFRAME
        Allows self closing <iframe/> tag
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_SCRIPT

        public static final String ALLOW_SELFCLOSING_SCRIPT
        Allows self closing <script/> tag
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_TAGS

        public static final String ALLOW_SELFCLOSING_TAGS
        Allows self closing tags e.g. <div/> (XHTML)
        See Also:
        Constant Field Values
      • PLAIN_ATTRIBUTE_VALUES

        public static final String PLAIN_ATTRIBUTE_VALUES
        Store the plain attribute values also.
        See Also:
        Constant Field Values
      • NAMES_ELEMS

        public static final String NAMES_ELEMS
        Modify HTML element names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • NAMES_ATTRS

        public static final String NAMES_ATTRS
        Modify HTML attribute names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • STATE_CONTENT

        protected static final short STATE_CONTENT
        State: content.
        See Also:
        Constant Field Values
      • STATE_MARKUP_BRACKET

        protected static final short STATE_MARKUP_BRACKET
        State: markup bracket.
        See Also:
        Constant Field Values
      • STATE_START_DOCUMENT

        protected static final short STATE_START_DOCUMENT
        State: start document.
        See Also:
        Constant Field Values
      • STATE_END_DOCUMENT

        protected static final short STATE_END_DOCUMENT
        State: end document.
        See Also:
        Constant Field Values
      • NAMES_NO_CHANGE

        protected static final short NAMES_NO_CHANGE
        Don't modify HTML names.
        See Also:
        Constant Field Values
      • NAMES_UPPERCASE

        protected static final short NAMES_UPPERCASE
        Uppercase HTML names.
        See Also:
        Constant Field Values
      • NAMES_LOWERCASE

        protected static final short NAMES_LOWERCASE
        Lowercase HTML names.
        See Also:
        Constant Field Values
      • DEBUG_CALLBACKS

        protected static final boolean DEBUG_CALLBACKS
        Set to true to debug callbacks.
        See Also:
        Constant Field Values
      • fNamesElems

        protected short fNamesElems
        Modify HTML element names.
      • fNamesAttrs

        protected short fNamesAttrs
        Modify HTML attribute names.
      • fDefaultIANAEncoding

        protected String fDefaultIANAEncoding
        Default encoding.
      • fDoctypePubid

        protected String fDoctypePubid
        Doctype declaration public identifier.
      • fDoctypeSysid

        protected String fDoctypeSysid
        Doctype declaration system identifier.
      • fBeginLineNumber

        protected int fBeginLineNumber
        Beginning line number.
      • fBeginColumnNumber

        protected int fBeginColumnNumber
        Beginning column number.
      • fBeginCharacterOffset

        protected int fBeginCharacterOffset
        Beginning character offset in the file.
      • fCurrentEntityStack

        protected final MiniStack<org.htmlunit.cyberneko.HTMLScanner.CurrentEntity> fCurrentEntityStack
        The current entity stack.
      • fScannerState

        protected short fScannerState
        The current scanner state.
      • fIANAEncoding

        protected String fIANAEncoding
        Auto-detected IANA encoding.
      • fJavaEncoding

        protected String fJavaEncoding
        Auto-detected Java encoding.
      • fElementCount

        protected int fElementCount
        Element count.
      • fElementDepth

        protected int fElementDepth
        Element depth.
      • fSpecialScanner

        protected final HTMLScanner.SpecialScanner fSpecialScanner
        Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
      • fStringBuffer

        protected final XMLString fStringBuffer
        String buffer.
    • Method Detail

      • pushInputSource

        public void pushInputSource​(XMLInputSource inputSource)
        Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.

        Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.

        Parameters:
        inputSource - The new input source to start scanning.
        See Also:
        evaluateInputSource(XMLInputSource)
      • evaluateInputSource

        public void evaluateInputSource​(XMLInputSource inputSource)
        Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).
        Parameters:
        inputSource - The new input source to start evaluating.
        See Also:
        pushInputSource(XMLInputSource)
      • cleanup

        public void cleanup​(boolean closeall)
        Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.
        Parameters:
        closeall - Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
      • getPublicId

        public String getPublicId()
        Returns the public identifier.
        Specified by:
        getPublicId in interface Locator
      • getBaseSystemId

        public String getBaseSystemId()
        Returns the base system identifier.
        Specified by:
        getBaseSystemId in interface XMLLocator
        Returns:
        the base system identifier.
      • getLiteralSystemId

        public String getLiteralSystemId()
        Returns the literal system identifier.
        Specified by:
        getLiteralSystemId in interface XMLLocator
        Returns:
        the literal system identifier.
      • getSystemId

        public String getSystemId()
        Returns the expanded system identifier.
        Specified by:
        getSystemId in interface Locator
      • getLineNumber

        public int getLineNumber()
        Returns the current line number.
        Specified by:
        getLineNumber in interface Locator
      • getColumnNumber

        public int getColumnNumber()
        Returns the current column number.
        Specified by:
        getColumnNumber in interface Locator
      • getCharacterOffset

        public int getCharacterOffset()
        Returns the character offset.
        Specified by:
        getCharacterOffset in interface XMLLocator
        Returns:
        the character offset, or -1 if no character offset is available.
      • getFeatureDefault

        public Boolean getFeatureDefault​(String featureId)
        Returns the default state for a feature.
        Specified by:
        getFeatureDefault in interface HTMLComponent
        Specified by:
        getFeatureDefault in interface XMLComponent
        Parameters:
        featureId - The feature identifier.
        Returns:
        the default state for a feature, or null if this component does not want to report a default value for this feature.
      • getPropertyDefault

        public Object getPropertyDefault​(String propertyId)
        Returns the default state for a property.
        Specified by:
        getPropertyDefault in interface HTMLComponent
        Specified by:
        getPropertyDefault in interface XMLComponent
        Parameters:
        propertyId - The property identifier.
        Returns:
        the default state for a property, or null if this component does not want to report a default value for this property
      • getRecognizedFeatures

        public String[] getRecognizedFeatures()
        Returns recognized features.
        Specified by:
        getRecognizedFeatures in interface XMLComponent
        Returns:
        an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
      • getRecognizedProperties

        public String[] getRecognizedProperties()
        Returns recognized properties.
        Specified by:
        getRecognizedProperties in interface XMLComponent
        Returns:
        an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
      • setFeature

        public void setFeature​(String featureId,
                               boolean state)
        Sets a feature.
        Specified by:
        setFeature in interface XMLComponent
        Parameters:
        featureId - The feature identifier.
        state - The state of the feature.
      • setInputSource

        public void setInputSource​(XMLInputSource source)
                            throws IOException
        Sets the input source.
        Parameters:
        source - The input source.
        Throws:
        IOException - Thrown on i/o error.
      • scanDocument

        public boolean scanDocument​(boolean complete)
                             throws XNIException,
                                    IOException
        Scans a document.
        Parameters:
        complete - True if the scanner should scan the document completely, pushing all events to the registered document handler. A value of false indicates that the scanner should only scan the next portion of the document and return. A scanner instance is permitted to completely scan a document if it does not support this "pull" scanning model.
        Returns:
        True if there is more to scan, false otherwise.
        Throws:
        IOException - Thrown on i/o error.
        XNIException - on error.
      • systemId

        public static String systemId​(String systemId,
                                      String baseSystemId)
        Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.
        Parameters:
        systemId - The systemId to be expanded.
        baseSystemId - baseSystemId
        Returns:
        Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
      • fixURI

        protected static String fixURI​(String str)
        Fixes a platform dependent filename to standard URI form.
        Parameters:
        str - The string to fix.
        Returns:
        Returns the fixed URI string.
      • modifyName

        protected static String modifyName​(String name,
                                           short mode)
      • getNamesValue

        protected static short getNamesValue​(String value)
      • setScannerState

        protected void setScannerState​(short state)
      • locationAugs

        protected final Augmentations locationAugs​(org.htmlunit.cyberneko.HTMLScanner.CurrentEntity currentEntity)
      • synthesizedAugs

        protected final Augmentations synthesizedAugs()
      • nextContent

        protected String nextContent​(int len)
                              throws IOException
        Reads the next characters WITHOUT impacting the buffer content up to current offset.
        Parameters:
        len - the number of characters to read
        Returns:
        the read string (length may be smaller if EOF is encountered)
        Throws:
        IOException - in case of io problems
      • readPreservingBufferContent

        protected int readPreservingBufferContent()
                                           throws IOException
        Throws:
        IOException