tautologistics · kirbysayshi · Jun 2, 2012 · Jun 21, 2012 · Jun 21, 2012 · Jul 4, 2012
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,2 @@
+# Auto detect text files and perform LF normalization
+* text eol=lf
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+npm-debug.log
+node_modules
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,5 @@
+language: node_js
+node_js:
+  - 0.8
+  - 0.10
+  - 0.11
diff --git a/CHANGELOG b/CHANGELOG
diff --git a/README.md b/README.md
@@ -1,186 +1,81 @@
-#NodeHtmlParser
-A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS (yes, despite the name it works just fine in any modern browser). The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.
+#htmlparser2 [![NPM version](https://badge.fury.io/js/htmlparser2.png)](https://npmjs.org/package/htmlparser2) [![Build Status](https://secure.travis-ci.org/fb55/htmlparser2.png)](http://travis-ci.org/fb55/htmlparser2) [![Dependency Status](https://david-dm.org/fb55/htmlparser2.png)](https://david-dm.org/fb55/htmlparser2)
 
-##Installing
+A forgiving HTML/XML/RSS parser written in JS for NodeJS. The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.
 
-	npm install htmlparser
-
-##Running Tests
-
-###Run tests under node:
-	node runtests.js
-
-###Run tests in browser:
-View runtests.html in any browser
-
-##Usage In Node
-	var htmlparser = require("htmlparser");
-	var rawHtml = "Xyz <script language= javascript>var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->";
-	var handler = new htmlparser.DefaultHandler(function (error, dom) {
-		if (error)
-			[...do something for errors...]
-		else
-			[...parsing done, do something...]
-	});
-	var parser = new htmlparser.Parser(handler);
-	parser.parseComplete(rawHtml);
-	sys.puts(sys.inspect(handler.dom, false, null));
-
-##Usage In Browser
-	var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error, dom) {
-		if (error)
-			[...do something for errors...]
-		else
-			[...parsing done, do something...]
-	});
-	var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
-	parser.parseComplete(document.body.innerHTML);
-	alert(JSON.stringify(handler.dom, null, 2));
-
-##Example output
-	[ { raw: 'Xyz ', data: 'Xyz ', type: 'text' }
-	, { raw: 'script language= javascript'
-	  , data: 'script language= javascript'
-	  , type: 'script'
-	  , name: 'script'
-	  , attribs: { language: 'javascript' }
-	  , children: 
-	     [ { raw: 'var foo = \'<bar>\';<'
-	       , data: 'var foo = \'<bar>\';<'
-	       , type: 'text'
-	       }
-	     ]
-	  }
-	, { raw: '<!-- Waah! -- '
-	  , data: '<!-- Waah! -- '
-	  , type: 'comment'
-	  }
-	]
-
-##Streaming To Parser
-	while (...) {
-		...
-		parser.parseChunk(chunk);
+##Installing
+	npm install htmlparser2
+
+A live demo of htmlparser2 is available at http://demos.forbeslindesay.co.uk/htmlparser2/
+
+##Usage
+
+```javascript
+var htmlparser = require("htmlparser2");
+var parser = new htmlparser.Parser({
+	onopentag: function(name, attribs){
+		if(name === "script" && attribs.type === "text/javascript"){
+			console.log("JS! Hooray!");
+		}
+	},
+	ontext: function(text){
+		console.log("-->", text);
+	},
+	onclosetag: function(tagname){
+		if(tagname === "script"){
+			console.log("That's it?!");
+		}
 	}
-	parser.done();	
+});
+parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</ script>");
+parser.end();
+```
 
-##Parsing RSS/Atom Feeds
+Output (simplified):
 
-	new htmlparser.RssHandler(function (error, dom) {
-		...
-	});
+```javascript
+--> Xyz 
+JS! Hooray!
+--> var foo = '<<bar>>';
+That's it?!
+```
 
-##DefaultHandler Options
+Read more about the parser in the [wiki](https://github.com/fb55/htmlparser2/wiki/Parser-options).
 
-###Usage
-	var handler = new htmlparser.DefaultHandler(
-		  function (error) { ... }
-		, { verbose: false, ignoreWhitespace: true }
-		);
-
-###Option: ignoreWhitespace
-Indicates whether the DOM should exclude text nodes that consists solely of whitespace. The default value is "false".
-
-####Example: true
-The following HTML:
-	<font>
-		<br>this is the text
-	<font>
-becomes:
-	[ { raw: 'font'
-	  , data: 'font'
-	  , type: 'tag'
-	  , name: 'font'
-	  , children: 
-	     [ { raw: 'br', data: 'br', type: 'tag', name: 'br' }
-	     , { raw: 'this is the text\n'
-	       , data: 'this is the text\n'
-	       , type: 'text'
-	       }
-	     , { raw: 'font', data: 'font', type: 'tag', name: 'font' }
-	     ]
-	  }
-	]
-
-####Example: false
-The following HTML:
-	<font>
-		<br>this is the text
-	<font>
-becomes:
-	[ { raw: 'font'
-	  , data: 'font'
-	  , type: 'tag'
-	  , name: 'font'
-	  , children: 
-	     [ { raw: '\n\t', data: '\n\t', type: 'text' }
-	     , { raw: 'br', data: 'br', type: 'tag', name: 'br' }
-	     , { raw: 'this is the text\n'
-	       , data: 'this is the text\n'
-	       , type: 'text'
-	       }
-	     , { raw: 'font', data: 'font', type: 'tag', name: 'font' }
-	     ]
-	  }
-	]
-
-###Option: verbose
-Indicates whether to include extra information on each node in the DOM. This information consists of the "raw" attribute (original, unparsed text found between "<" and ">") and the "data" attribute on "tag", "script", and "comment" nodes. The default value is "true". 
-
-####Example: true
-The following HTML:
-	<a href="test.html">xxx</a>
-becomes:
-	[ { raw: 'a href="test.html"'
-	  , data: 'a href="test.html"'
-	  , type: 'tag'
-	  , name: 'a'
-	  , attribs: { href: 'test.html' }
-	  , children: [ { raw: 'xxx', data: 'xxx', type: 'text' } ]
-	  }
-	]
-
-####Example: false
-The following HTML:
-	<a href="test.html">xxx</a>
-becomes:
-	[ { type: 'tag'
-	  , name: 'a'
-	  , attribs: { href: 'test.html' }
-	  , children: [ { data: 'xxx', type: 'text' } ]
-	  }
-	]
-
-###Option: enforceEmptyTags
-Indicates whether the DOM should prevent children on tags marked as empty in the HTML spec. Typically this should be set to "true" HTML parsing and "false" for XML parsing. The default value is "true".
-
-####Example: true
-The following HTML:
-	<link>text</link>
-becomes:
-	[ { raw: 'link', data: 'link', type: 'tag', name: 'link' }
-	, { raw: 'text', data: 'text', type: 'text' }
-	]
-
-####Example: false
-The following HTML:
-	<link>text</link>
-becomes:
-	[ { raw: 'link'
-	  , data: 'link'
-	  , type: 'tag'
-	  , name: 'link'
-	  , children: [ { raw: 'text', data: 'text', type: 'text' } ]
-	  }
-	]
-
-##DomUtils
-
-###TBD (see utils_example.js for now)
-
-##Related Projects
-
-Looking for CSS selectors to search the DOM? Try Node-SoupSelect, a port of SoupSelect to NodeJS: http://github.com/harryf/node-soupselect
-
-There's also a port of hpricot to NodeJS that uses HtmlParser for HTML parsing: http://github.com/silentrob/Apricot
+##Get a DOM
+The `DomHandler` (known as `DefaultHandler` in the original `htmlparser` module) produces a DOM (document object model) that can be manipulated using the [`DomUtils`](https://github.com/fb55/DomUtils) helper.
+
+The `DomHandler`, while still bundled with this module, was moved to its [own module](https://github.com/fb55/domhandler). Have a look at it for further information.
+
+##Parsing RSS/RDF/Atom Feeds
+
+```javascript
+new htmlparser.FeedHandler(function(<error> error, <object> feed){
+    ...
+});
+```
+
+##Performance
+
+After having some artificial benchmarks for some time, __@AndreasMadsen__ published his [`htmlparser-benchmark`](https://github.com/AndreasMadsen/htmlparser-benchmark), which benchmarks HTML parses based on real-world websites.
+
+At the time of writing, the latest versions of all supported parsers show the following performance characteristics on [Travis CI](https://travis-ci.org/AndreasMadsen/htmlparser-benchmark/builds/10805007) (please note that Travis doesn't guarantee equal conditions for all tests):
+
+```
+gumbo-parser   : 34.9208 ms/file ± 21.4238
+html-parser    : 24.8224 ms/file ± 15.8703
+html5          : 419.597 ms/file ± 264.265
+htmlparser     : 60.0722 ms/file ± 384.844
+htmlparser2-dom: 12.0749 ms/file ± 6.49474
+htmlparser2    : 7.49130 ms/file ± 5.74368
+hubbub         : 30.4980 ms/file ± 16.4682
+libxmljs       : 14.1338 ms/file ± 18.6541
+parse5         : 22.0439 ms/file ± 15.3743
+sax            : 49.6513 ms/file ± 26.6032
+```
+
+##How is this different from [node-htmlparser](https://github.com/tautologistics/node-htmlparser)?
+This is a fork of the `htmlparser` module. The main difference is that this is intended to be used only with node (it runs on other platforms using [browserify](https://github.com/substack/node-browserify)). `htmlparser2` was rewritten multiple times and, while it maintains an API that's compatible with `htmlparser` in most cases, the projects don't share any code anymore.
+
+The parser now provides a callback interface close to [sax.js](https://github.com/isaacs/sax-js) (originally targeted at [readabilitySAX](https://github.com/fb55/readabilitysax)). As a result, old handlers won't work anymore.
 
+The `DefaultHandler` and the `RssHandler` were renamed to clarify their purpose (to `DomHandler` and `FeedHandler`). The old names are still available when requiring `htmlparser2`, so your code should work as expected.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Auto detect text files and perform LF normalization
		* text eol=lf