Interface LinkExtractorParser

  • All Implemented Interfaces:

    
    public interface LinkExtractorParser
    
                        

    Interface specifying contract of content parser that aims to extract links

    Since:

    3.0

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
    • Field Summary

      Fields 
      Modifier and Type Field Description
    • Constructor Summary

      Constructors 
      Constructor Description
    • Enum Constant Summary

      Enum Constants 
      Enum Constant Description
    • Method Summary

      Modifier and Type Method Description
      abstract Iterator<URL> getEmbeddedResourceURLs(String userAgent, Array<byte> responseData, URL baseUrl, String encoding) Get the URLs for all the resources that a browser would automatically download following the download of the content, that is: images, stylesheets, javascript files, applets, etc...
      abstract boolean isReusable()
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

    • Method Detail

      • getEmbeddedResourceURLs

         abstract Iterator<URL> getEmbeddedResourceURLs(String userAgent, Array<byte> responseData, URL baseUrl, String encoding)

        Get the URLs for all the resources that a browser would automatically download following the download of the content, that is: images, stylesheets, javascript files, applets, etc...

        URLs should not appear twice in the returned iterator.

        Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.

        Parameters:
        userAgent - User Agent
        responseData - Response data
        baseUrl - Base URL from which the HTML code was obtained
        encoding - Charset
        Returns:

        an Iterator for the resource URLs