Html2sax is a DFA-based parser for HTML pages. The motivation is easy. Many HTML pages out there are not standard conform. The web-browsers have special parsers that try to repair the problematic webpages before displaying them. This can get quite complex because there are many alternatives fixing something wrong.

Design

Html2sax is designed to be the frontend of a web-spider reading websites. It can handle (almost?) all error situations, but will not try to correct problematic HTML pages. It operates on a very low level and is quite fast. Tests showed that it is twice as fast as Html-Tidy. It will replace HTML entities like "&" to "&aml;". If you need correct HTML documents you can repair your documents yourself.

Restrictions

There are several restrictions for Html2sax that you should be aware of:

Requirements

The only requirements for the parser is a Javatm 1.6 JRE. Less Java won't work because of using generics.

Usage

Usage is quite simple. The following example runs the parser:
		SAXParserFactory factory = SAXParserFactory.newInstance("de.sfuhrm.htmltosax.HtmlToSaxParserFactory", null);
		SAXParser parser = factory.newSAXParser();
		YourCallback s = new YourCallback();
		parser.parse(new InputSource(new URL(args[0]).openStream()), s);
		
A working example is in the file Sample.java in the source distribution.

License

Html2sax is licensed under the LGPL 2.1 and only under this license version. Please see www.gnu.org for more details on the license.

Author

The software was written by Stephan Fuhrmann. I can be reached at s_fuhrm (at) web.de.