Html2sax is a DFA-based parser for HTML pages.
The motivation is easy. Many HTML pages out there are not
standard conform. The web-browsers have special parsers that try to
repair the problematic webpages before displaying them. This can
get quite complex because there are many alternatives fixing something
wrong.
Design
Html2sax is designed to be the frontend of a web-spider reading websites.
It can handle (almost?) all error situations, but will not try to
correct problematic HTML pages. It operates on a very low level and is quite fast.
Tests showed that it is twice as fast as Html-Tidy.
It will replace HTML entities like "&" to "&aml;".
If you need correct HTML documents you can repair your documents yourself.
Restrictions
There are several restrictions for Html2sax that you should be aware of:
- HTML is not XML. This means SAX is an API for this kind of callbacks, but most existing
tools having a SAX input interface will fail with Html2sax input.
- No DTD-support. You need to do your HTML-thinking for yourself.
- Won't protect your parser callback from senseless trash if documents are really weird.
- Won't repair corrupt documents.
Requirements
The only requirements for the parser is a Javatm 1.6 JRE.
Less Java won't work because of using generics.
Usage
Usage is quite simple. The following example runs the parser:
SAXParserFactory factory = SAXParserFactory.newInstance("de.sfuhrm.htmltosax.HtmlToSaxParserFactory", null);
SAXParser parser = factory.newSAXParser();
YourCallback s = new YourCallback();
parser.parse(new InputSource(new URL(args[0]).openStream()), s);
A working example is in the file Sample.java in the source distribution.
License
Html2sax is licensed under the LGPL 2.1 and
only under this license version. Please see
www.gnu.org for more
details on the license.
Author
The software was written by Stephan Fuhrmann. I can be reached at
s_fuhrm (at) web.de.