Class XHTMLParser

  • Direct Known Subclasses:
    HCParser

    @NotThreadSafe
    public class XHTMLParser
    extends Object
    Utility class for parsing stuff as HTML.
    Author:
    Philip Helger
    • Method Detail

      • createDefaultSAXReaderSettings

        @Nonnull
        @ReturnsMutableCopy
        public static com.helger.xml.serialize.read.SAXReaderSettings createDefaultSAXReaderSettings()
      • getHTMLVersion

        @Nonnull
        public EHTMLVersion getHTMLVersion()
        Returns:
        The HTML version as specified in the constructor. Never null.
      • getAdditionalSAXReaderSettings

        @Deprecated(forRemoval=true,
                    since="9.1.1")
        @Nonnull
        @ReturnsMutableCopy
        public com.helger.xml.serialize.read.SAXReaderSettings getAdditionalSAXReaderSettings()
        Deprecated, for removal: This API element is subject to removal in a future version.
        Returns:
        A copy of the additional SAX reader settings that are used for parsing. By default a secure processing is active, that disallows inline DTDs in HTML documents.
      • getSAXReaderSettings

        @Nonnull
        @ReturnsMutableCopy
        public com.helger.xml.serialize.read.SAXReaderSettings getSAXReaderSettings()
        Returns:
        A copy of the additional SAX reader settings that are used for parsing. By default a secure processing is active, that disallows inline DTDs in HTML documents.
        Since:
        9.1.1
      • setAdditionalSAXReaderSettings

        @Deprecated(forRemoval=true,
                    since="9.1.1")
        public void setAdditionalSAXReaderSettings​(@Nullable
                                                   com.helger.xml.serialize.read.ISAXReaderSettings aAdditionalSaxReaderSettings)
        Deprecated, for removal: This API element is subject to removal in a future version.
        Set additional SAX reader settings that are used when an XHTML fragment is read. All settings are reused when parsing except for the entity resolver which is always set to the default HTMLEntityResolver.
        Parameters:
        aAdditionalSaxReaderSettings - The settings to be used. May be null.
      • setSAXReaderSettings

        @Nonnull
        public XHTMLParser setSAXReaderSettings​(@Nullable
                                                com.helger.xml.serialize.read.ISAXReaderSettings aAdditionalSaxReaderSettings)
        Set additional SAX reader settings that are used when an XHTML fragment is read. All settings are reused when parsing except for the entity resolver which is always set to the default HTMLEntityResolver.
        Parameters:
        aAdditionalSaxReaderSettings - The settings to be used. May be null.
        Returns:
        this for chaining
        Since:
        9.1.1
      • looksLikeXHTML

        public static boolean looksLikeXHTML​(@Nullable
                                             String sText)
        Check whether the passed text looks like it contains XHTML code. This is a heuristic check only and does not perform actual parsing!
        Parameters:
        sText - The text to check.
        Returns:
        true if the text looks like HTML
      • isValidXHTMLFragment

        public boolean isValidXHTMLFragment​(@Nullable
                                            String sXHTMLFragment)
        Check if the given fragment is valid XHTML 1.1 mark-up. This method tries to parse the XHTML fragment, so it is potentially slow!
        Parameters:
        sXHTMLFragment - The XHTML fragment to parse. It is not checked, whether the value looks like HTML or not.
        Returns:
        true if the fragment is valid, false otherwise.
      • parseXHTMLFragment

        @Nullable
        public com.helger.xml.microdom.IMicroDocument parseXHTMLFragment​(@Nullable
                                                                         String sXHTMLFragment)
        Parse the given fragment as XHTML 1.1. This is a sanity method for parseXHTMLFragment(String) with the predefined XHTML 1.1 document type.
        Parameters:
        sXHTMLFragment - The XHTML fragment to parse. May be null.
        Returns:
        null if parsing failed.
      • parseXHTMLDocument

        @Nullable
        public com.helger.xml.microdom.IMicroDocument parseXHTMLDocument​(@Nullable
                                                                         String sXHTML)
        This method parses a full HTML document into a IMicroDocument using the additional SAX reader settings and always the HTMLEntityResolver as an entity resolver.
        Parameters:
        sXHTML - The complete XHTML document as a string. May be null.
        Returns:
        null if interpretation failed
      • unescapeXHTMLFragment

        @Nullable
        public com.helger.xml.microdom.IMicroContainer unescapeXHTMLFragment​(@Nullable
                                                                             String sXHTML)
        Interpret the passed XHTML fragment as HTML and retrieve a result container with all body elements.
        Parameters:
        sXHTML - The XHTML text fragment. This fragment is parsed as an HTML body and may therefore not contain the <body> tag.
        Returns:
        null if the passed text could not be interpreted as XHTML or if no body element was found, an IMicroContainer with all body children otherwise.