Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
471 commits
Select commit Hold shift + click to select a range
7750ec1
removed switch in Stream.js
fb55 Jun 2, 2012
04476a0
fixed whitespace
fb55 Jun 21, 2012
18d3f37
quick fix for #19
fb55 Jun 21, 2012
69c9f0f
Fix getOuterHTML for directives
lahmatiy Jul 4, 2012
f8e6aad
Merge pull request #21 from lahmatiy/master
fb55 Jul 18, 2012
82455a9
added lowerCaseAttributeNames option
fb55 Aug 11, 2012
e0d359e
2.3.0
fb55 Aug 14, 2012
a8c13c8
Added a `onopentagend` event
fb55 Aug 14, 2012
c1dfdda
moved DomHandler & DomUtils to their own module
fb55 Aug 14, 2012
c0b7eda
Updated readme
fb55 Aug 14, 2012
a928109
2.3.1
fb55 Aug 14, 2012
b90c1e6
publish the element types from DomHandler
fb55 Aug 14, 2012
b6c4a73
use numeric element types
fb55 Aug 14, 2012
401cc09
don't expose HandlerModule
fb55 Aug 14, 2012
f5925c9
fixed travis badge
fb55 Aug 23, 2012
181c31b
stylistic changes
fb55 Nov 10, 2012
84012d6
use the new dom modules, 2.5.0
fb55 Nov 10, 2012
b3bc413
Made the attribute regular expression more correct with regards to un…
myndzi Nov 30, 2012
0f71a49
I didn't understand how RegExps worked in this way, and was desynchin…
myndzi Nov 30, 2012
f7b6d54
Revert "stylistic changes"
fb55 Dec 4, 2012
c75da20
Revert "Revert "stylistic changes""
fb55 Dec 4, 2012
6730fde
added missing comma in benchmark script
fb55 Dec 4, 2012
840291e
domelementtype must be version 1.x (not 1.0)
fb55 Jan 9, 2013
46cd546
2.5.1
fb55 Jan 9, 2013
a68f329
Merge branch 'master' of https://github.com/fb55/node-htmlparser
Feb 4, 2013
a83c708
Better handling of implied close tags. A list is given of tags whose …
Feb 5, 2013
a1777a9
spaces -> tabs, thought the merge would update my local files to the …
Feb 5, 2013
a126b18
Derp.
Feb 11, 2013
5a72c28
added missing comma in benchmark script
fb55 Dec 4, 2012
eca12d8
domelementtype must be version 1.x (not 1.0)
fb55 Jan 9, 2013
7f0389f
2.5.1
fb55 Jan 9, 2013
8df87ab
Recognize closing CDATA tags as end of "special"
jugglinmike Feb 14, 2013
ef8b078
Merge pull request #31 from jugglinmike/text-after-cdata
fb55 Feb 15, 2013
d21706b
test on node 0.6, 0.8 & 0.9
fb55 Feb 15, 2013
4dc73a5
FeedHandler should return an error when nothing's found
fb55 Feb 15, 2013
e976099
added missing semicolon in test-helper.js
fb55 Feb 15, 2013
36650b8
improved how tests are run
fb55 Feb 15, 2013
610da2c
don't run 03-rdf.js test
fb55 Feb 15, 2013
0746690
renamed tests
fb55 Feb 15, 2013
d1d9cae
added semicolons & use EE#on in 02-stream.js
fb55 Feb 15, 2013
7c77a1f
changed how the end of all tests is shown
fb55 Feb 15, 2013
0494e90
allow `>` at the beginning of a document
fb55 Feb 15, 2013
f707bd7
2.5.2
fb55 Feb 15, 2013
2fc40c5
Merge remote-tracking branch 'upstream/master'
Feb 15, 2013
05a99ef
Tests for changes.
Feb 16, 2013
fe6b8d6
Fixes discussed in https://github.com/fb55/node-htmlparser/pull/28
Feb 16, 2013
33d55cd
Merge pull request #28 from myndzi/master
fb55 Feb 16, 2013
f162767
Update README.md
fb55 Feb 16, 2013
c0bd69c
Do not parse CDATA-like text inside special tags
jugglinmike Feb 21, 2013
5b096bf
Merge pull request #32 from jugglinmike/cdata-inside-special
fb55 Feb 25, 2013
8756001
2.6.0
fb55 Mar 17, 2013
5e6fcb3
landed first version of FSM based tokenizer
fb55 Mar 21, 2013
7be1360
Add a new test for issue #36
Mar 27, 2013
833432b
Merge pull request #37 from eonlepapillon/Add-test-for-Issue-#36
fb55 Mar 28, 2013
d90e7a3
added logic for special tags
fb55 Mar 30, 2013
aa19a0b
[tokenizer] don't fail on `< >` and `< / >`
fb55 Mar 30, 2013
1bc6568
[tokenizer] fixed ordering in cleanup
fb55 Mar 30, 2013
400bf43
[tokenizer] overwrite WritableStream#end, emit everything that's left
fb55 Mar 30, 2013
550b42e
[tokenizer] take care of this._index in cleanup, emit all text
fb55 Mar 30, 2013
dabe165
[tokenizer] set _sectionStart to 0 when text was emitted
fb55 Mar 30, 2013
b9d568a
[tokenizer] call WritableStream#end after emitting the remaining data
fb55 Mar 30, 2013
1144e42
[tokenizer] call .write instead of ._write
fb55 Mar 30, 2013
c3d4025
[parser] use the tokenizer
fb55 Mar 30, 2013
627a38b
removed WritableStream.js and ElementType.js
fb55 Mar 30, 2013
358944e
[parser] made Parser#reset work again
fb55 Mar 30, 2013
5c155ca
fall back to the readable-stream module
fb55 Mar 30, 2013
5a28547
[travis] removed 0.6 & 0.9, added 0.10 and 0.11
fb55 Mar 30, 2013
c445375
minor changes
fb55 Mar 30, 2013
1ab593a
[index.js] removed redundant code
fb55 Mar 30, 2013
f78d1ed
[stream] use a named function
fb55 Mar 30, 2013
1b6a264
3.0.0
fb55 Mar 30, 2013
b48adc2
[tokenizer] always call WritableStream#end
fb55 Mar 30, 2013
17b7ebe
[parser] call Tokenizer#end, clear the stack
fb55 Mar 30, 2013
654c4d4
[index.js] added `createDomStream()` convenience method
fb55 Mar 30, 2013
628b99e
[tokenizer] added `opentagend` event
fb55 Mar 30, 2013
f70f545
[parser] use `opentagend` event
fb55 Mar 30, 2013
b7cc1aa
3.0.1
fb55 Mar 30, 2013
acc0d05
[tokenizer] emit opentagend on selfclosing tags, fixed handling of < …
fb55 Mar 30, 2013
94e794f
[index.js] added tokenizer
fb55 Mar 30, 2013
9793593
[tests] text events now contain more data
fb55 Mar 30, 2013
ab8b653
[tokenizer] don't inherit from stream.Writable, fixed several bugs
fb55 Mar 31, 2013
09b8833
[tests/events] concat text events
fb55 Mar 31, 2013
00d63cf
[tests/events] fixed order of attribute/opentag events, merged text e…
fb55 Mar 31, 2013
643a7f0
[tokenizer] use strings instead of buffers
fb55 Mar 31, 2013
b837b95
[parser] don't implement stream.Writable, use new tokenizer interface
fb55 Mar 31, 2013
db95f00
[tests/stream] fixed order of events
fb55 Mar 31, 2013
e4982e1
[tokenizer] simplified logic
fb55 Mar 31, 2013
1905dd3
[parser] fixed handling of implied closing and empty tags
fb55 Mar 31, 2013
70c6865
[tests/events] accidentally removed part of the document
fb55 Mar 31, 2013
4a7eb12
added a WritableStream interface again
fb55 Mar 31, 2013
a23d7a6
3.0.0 (finally!)
fb55 Mar 31, 2013
1db8148
[tokenizer] changed internal name to `Tokenizer`
fb55 Mar 31, 2013
b7f6df5
[tokenizer] fix for script tags causing following nodes to be interpr…
Apr 3, 2013
9898b9a
[proxyhandler] don't use getters/setters
fb55 Apr 3, 2013
84815a3
added CollectingHandler
fb55 Apr 3, 2013
01d8adf
[tests] use the new CollectingHandler
fb55 Apr 3, 2013
f2542db
[tests] removed unused `f` var
fb55 Apr 4, 2013
fcb35f0
3.0.1
fb55 Apr 4, 2013
605aa6c
Merge pull request #38 from burl/master
fb55 Apr 4, 2013
779e608
3.0.2
fb55 Apr 4, 2013
c848d69
[bench] use setImmediate instead of process.nextTick
fb55 Apr 4, 2013
1384620
[bench] try to test all available modules
fb55 Apr 5, 2013
9f465ca
[bench] removed unused functions, improved output
fb55 Apr 5, 2013
2f38140
[readme] updated benchmarks
fb55 Apr 5, 2013
bc00862
[doc] call `end`, use single quotes
fb55 Apr 5, 2013
6935c0d
[doc] updated section about node-htmlparser
fb55 Apr 5, 2013
8a91aac
renamed repository, 3.0.3
fb55 Apr 9, 2013
e7ad785
use DomUtils.getText in fetch, split getElements
fb55 Apr 10, 2013
6b995ab
[tokenizer] name states consistently
fb55 Apr 15, 2013
0b88170
[feedhandler] recursively walk the tree
fb55 Apr 15, 2013
b06cb29
[readme] small updates
fb55 Apr 15, 2013
e6f0199
[tokenizer] don't emit an "onopentagend" event for self-closing tags
fb55 Apr 15, 2013
a3a9954
[parser] fixed handling of self-closing tags
fb55 Apr 15, 2013
9d478ea
[tests] stream tests are run again
fb55 Apr 15, 2013
e612238
[tests/feeds] run rdf test again
fb55 Apr 15, 2013
3b821dc
[tests/stream] enabled xmlMode for RSS test
fb55 Apr 15, 2013
1bb92f7
[tests/stream] create a new handler for the second run
fb55 Apr 15, 2013
ae58e56
[tests/stream] added tests for the files in tests/Documents
fb55 Apr 15, 2013
83c75dc
3.0.4
fb55 Apr 15, 2013
e36f3d0
[parser] lowercase instruction names if lowerCaseTags option is set
fb55 Apr 15, 2013
61c5a80
3.0.5
fb55 Apr 15, 2013
d79b1b3
[tests/events] added test case for jsdom#368
fb55 Apr 15, 2013
1123da8
changed behavior for non-xml mode
fb55 May 18, 2013
357a825
[tests/events] updated tests to reflect latest changes
fb55 May 18, 2013
96c41b1
3.1.0
fb55 May 18, 2013
75fb1cf
Added missing void elements.
papandreou May 29, 2013
f58c1d3
Merge pull request #46 from One-com/missing_void_elements
fb55 May 30, 2013
7ca6d22
[tokenizer] text in special tags there looks like a tag ending
AndreasMadsen Jun 5, 2013
46d3b21
Merge pull request #48 from AndreasMadsen/script-in-script
fb55 Jun 5, 2013
02f12e2
[tokenizer] consume token again
fb55 Jun 5, 2013
6e1669f
[parser] still recognize other options in non-xml-mode
fb55 Jun 5, 2013
231a746
3.1.1
fb55 Jun 5, 2013
7ef5de8
[tokenizer] don't reset comment state in case of long endings
AndreasMadsen Jun 6, 2013
623cd89
Merge pull request #49 from AndreasMadsen/long-comment
fb55 Jun 6, 2013
e8dc84a
[Tokenizer] don't reset CDATA state in case of long endings
AndreasMadsen Jun 7, 2013
c88dd9a
Merge pull request #50 from AndreasMadsen/long-cdata-ending
fb55 Jun 7, 2013
a768e88
readme: added version badge
fb55 Jun 7, 2013
40a2339
[readme] added yet another badge (dependency versions)
fb55 Jun 7, 2013
8b390bd
[bench] added the hubbub & html-parser modules
fb55 Jun 9, 2013
dda8df2
3.1.2
fb55 Jun 9, 2013
7fd58aa
[Parser] open tags before close if never opened
AndreasMadsen Jun 10, 2013
694dea7
[Parser] implicit open only p and br tags
AndreasMadsen Jun 11, 2013
d64986c
Fix perf regression in the Tokenizer : avoid a concatenation
abarre Jun 13, 2013
a842129
Merge pull request #54 from abarre/master
fb55 Jun 13, 2013
0e320fc
Merge pull request #52 from AndreasMadsen/implicit-open
fb55 Jun 13, 2013
eade820
3.1.3
fb55 Jun 14, 2013
0ca2c1e
[parser] renamed emptyTags to voidElements, sorted them
fb55 Jun 14, 2013
26117ef
[parser] improved consistency & simplified
fb55 Jun 14, 2013
7932367
[tokenizer] simplified `end` logic
fb55 Jun 14, 2013
45d9067
[tokenizer] removed noop blocks in AFTER_{COMMENT,CDATA}_2
fb55 Jun 14, 2013
87c6f2b
[tokenizer] use `continue` instead of decreasing the index
fb55 Jun 14, 2013
7608c11
[bench] removed unnecessary noop functions
fb55 Jun 14, 2013
d00b391
[tokenizer] improved handling of remaining data
fb55 Jun 14, 2013
863183a
[readme] it~~'~~s
fb55 Jun 15, 2013
77bf0ae
Add parseDOM and parseFeed helper methods
ForbesLindesay Jun 20, 2013
740bbe9
Merge pull request #55 from ForbesLindesay/patch-1
fb55 Jun 20, 2013
16aef00
Add link to live demo
ForbesLindesay Jun 20, 2013
288bb93
Merge pull request #56 from ForbesLindesay/patch-1
fb55 Jun 20, 2013
b00177f
[parser] default options & cbs to empty objects
fb55 Jun 23, 2013
529f727
3.1.4
fb55 Jun 23, 2013
9f54942
[tokenizer] fix case where `<` followed by whitespace doesn't parse c…
Jul 19, 2013
d3c1fcd
Merge pull request #58 from xcoderzach/master
fb55 Jul 20, 2013
830c157
3.1.5
fb55 Jul 21, 2013
a6b6865
[parser] don't overwrite attribute values on second occurence
fb55 Jul 21, 2013
4d56157
[readme] behavior of example changed due to #58
fb55 Jul 21, 2013
ca311d4
Add .gitignore
ForbesLindesay Jul 21, 2013
909a3f1
Add .gitattributes so tests still work on windows
ForbesLindesay Jul 21, 2013
f6f93ef
Normalize line endings
ForbesLindesay Jul 21, 2013
263775f
[tokenizer] recognize the form field (U+0C), drop the carriage return…
fb55 Jul 30, 2013
f8ddbe6
[Tokenizer] move if context to methods allowing .write to be optimized
AndreasMadsen Aug 1, 2013
9ab0b0e
Merge pull request #61 from AndreasMadsen/optimize
fb55 Aug 2, 2013
0219e3a
[tokenizer] don't save the options object
fb55 Aug 2, 2013
2aae96f
[tokenizer] use ternary expressions for simple states
fb55 Aug 2, 2013
f6e21dd
[tokenizer] added variables for states of _special
fb55 Aug 2, 2013
f3fb8d7
[tokenizer] fixed whitespace
fb55 Aug 2, 2013
bf0eaa4
[tokenizer] more ternaries
fb55 Aug 2, 2013
57eb985
[tokenizer] simplified _cleanup a bit
fb55 Aug 2, 2013
917ecf0
[tokenizer] united some branches
fb55 Aug 2, 2013
7f9082c
[tokenizer] get rid of _reconsume
fb55 Aug 2, 2013
4bc1ec4
[tokenizer] even more ternaries
fb55 Aug 2, 2013
24bbf86
[tokenizer] added abstractions for common state types, fixed previous…
fb55 Aug 2, 2013
ce87df1
[tokenizer] added _getSection, completely inlined _emitIfToken, partl…
fb55 Aug 2, 2013
607c81a
[tokenizer] simplified _stateInTagName
fb55 Aug 2, 2013
5b8955a
[tokenizer] simplified _stateInAttributeValueNoQuotes, reordered _sta…
fb55 Aug 2, 2013
bd63b0b
3.1.6
fb55 Aug 2, 2013
4589ecd
[tests] added test for second occurance of same attribute
fb55 Aug 2, 2013
9eea898
[tokenizer] started adding support for HTML entities
fb55 Aug 2, 2013
fac2449
[tokenizer] corrected decoding of numeric entities
fb55 Aug 2, 2013
e485fb2
[tokenizer] numeric entities are now decoded
fb55 Aug 2, 2013
a6fb99e
[tests] added test case for numeric entities
fb55 Aug 2, 2013
bcd00ed
Update link to demo
ForbesLindesay Aug 6, 2013
c2db3df
Add startIndex and endIndex positional attributes to the parser
Aug 7, 2013
6330226
Merge pull request #63 from fasterize/parser_positions
fb55 Aug 16, 2013
b70b28d
[tokenizer] renamed the self-closing tags state, moved it to its own …
fb55 Aug 16, 2013
ad1d8f0
[tokenizer] commented out support for entities in attributes
fb55 Aug 16, 2013
ab8926e
[readme] updated benchmark results
fb55 Aug 16, 2013
e5197b3
[bench] removed internal benchmarks
fb55 Aug 16, 2013
bc193a6
[parser] fixed whitespace
fb55 Aug 16, 2013
2221630
[parser] moved common logic to _updatePosition function
fb55 Aug 17, 2013
d26e087
[tokenizer] renamed IN_ATTRIBUTE_NAME_* states, improved formatting
fb55 Aug 17, 2013
163a4ce
[tokenizer] re-added the carriage return as whitespace
fb55 Aug 17, 2013
ea26f0e
[tokenizer] fixed handling of unparsed data in end(), added support f…
fb55 Aug 18, 2013
3a92796
[entities] added maps for normal & legacy entities
fb55 Aug 18, 2013
ba3c1c7
[tokenizer] added support for decoding HTML entities in `ontext` events
fb55 Aug 18, 2013
e9a8496
[tests] added test cases for decoding legacy & named entities
fb55 Aug 18, 2013
927a9e9
[entities] added map for XML entities
fb55 Aug 18, 2013
7adb053
[tokenizer] added support for XML entities
fb55 Aug 18, 2013
b60cf04
[tests] also test trailing data support in the numeric entity test
fb55 Aug 18, 2013
e45e4ec
[tokenizer] fixed handling non-existent entities
fb55 Aug 18, 2013
12edc94
[tests] added test case for XML entities
fb55 Aug 18, 2013
271dee2
[tokenizer] added _emitEntity
fb55 Aug 18, 2013
076fcf7
3.2.0
fb55 Aug 18, 2013
f46765d
[tokenizer] moved decodeMap to entities/decode.json
fb55 Aug 18, 2013
389102d
[tokenizer] renamed _emitEntity to _emitPartial
fb55 Aug 18, 2013
6ca87ff
[index] statically export Parser, Tokenizer and DomHandler
fb55 Aug 18, 2013
1c8600b
[parser] use String#search and String#substr instead of String#split
fb55 Aug 18, 2013
e3a75dd
[parser] added onattribdata and onattribend events, dropped onattribv…
fb55 Aug 18, 2013
8494b03
[tokenizer] enable support for decoding entities in attributes, added…
fb55 Aug 18, 2013
feafd9d
[tests] added test case for entities in attributes
fb55 Aug 18, 2013
311e48e
3.2.1
fb55 Aug 18, 2013
e2fa485
[tokenizer] don't decode entities in special tags
fb55 Aug 18, 2013
36ee76e
3.2.2
fb55 Aug 18, 2013
cce466c
[tokenizer] reintroduced _special, removed IN_SCRIPT and IN_STYLE
fb55 Aug 18, 2013
effc3a9
3.2.3
fb55 Aug 18, 2013
e4fb613
only respect self-closing tags in XML mode
fb55 Aug 21, 2013
80a1ecb
[parser] properly removed self-closing tag support
fb55 Aug 22, 2013
0347cd7
[tests] read files in the tests file, improved os interoperability of…
fb55 Aug 22, 2013
be0dafa
[tests] added helper.getCallback method
fb55 Aug 22, 2013
b948e86
[tests] converted tests to mocha
fb55 Aug 25, 2013
8737bf1
[tests] renamed tests dir to `test`
fb55 Aug 25, 2013
96a00fb
[package] run mocha as the test script
fb55 Aug 25, 2013
41ad914
Delete .DS_Store
fb55 Aug 26, 2013
fc22b7d
[tokenizer] emit `onattribdata` in `_handleTrailingData`
fb55 Aug 28, 2013
336af9b
[tests] simplifications
fb55 Aug 26, 2013
fc0918c
3.2.4
fb55 Aug 29, 2013
7b1e4c9
[readme] updated performance characteristics
fb55 Aug 30, 2013
76643d3
[tokenizer] handle `<<` correctly
fb55 Aug 30, 2013
2f24491
3.2.5
fb55 Aug 30, 2013
834d6d2
[tests] added test case for MatthewMueller/cheerio#247
fb55 Aug 30, 2013
994cfda
update to [email protected], updated FeedHandler accordingly, bump
fb55 Sep 4, 2013
11eba28
[tests] write only single characters for testing chunked data
fb55 Sep 4, 2013
029c565
[package] require [email protected]
fb55 Oct 20, 2013
e6418c2
package: update readable-stream
fb55 Nov 22, 2013
0e5775c
package: use simple `license` field
fb55 Nov 22, 2013
2c568d3
replace non-breaking space with regular space
fb55 Nov 26, 2013
c9d4abe
index: pass `options` argument to constructors
fb55 Dec 10, 2013
298546c
tests: remove unused `cb` argument
fb55 Dec 10, 2013
f9bc72f
feedhandler: wrap assignments
fb55 Dec 10, 2013
5f244df
tests: changed indentation to tabs
fb55 Dec 10, 2013
7153b27
package: updated dom module versions, 3.4.0
fb55 Dec 12, 2013
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Auto detect text files and perform LF normalization
* text eol=lf
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
npm-debug.log
node_modules
5 changes: 5 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
language: node_js
node_js:
- 0.8
- 0.10
- 0.11
38 changes: 0 additions & 38 deletions CHANGELOG

This file was deleted.

251 changes: 73 additions & 178 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,186 +1,81 @@
#NodeHtmlParser
A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS (yes, despite the name it works just fine in any modern browser). The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.
#htmlparser2 [![NPM version](https://badge.fury.io/js/htmlparser2.png)](https://npmjs.org/package/htmlparser2) [![Build Status](https://secure.travis-ci.org/fb55/htmlparser2.png)](http://travis-ci.org/fb55/htmlparser2) [![Dependency Status](https://david-dm.org/fb55/htmlparser2.png)](https://david-dm.org/fb55/htmlparser2)

##Installing
A forgiving HTML/XML/RSS parser written in JS for NodeJS. The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.

npm install htmlparser

##Running Tests

###Run tests under node:
node runtests.js

###Run tests in browser:
View runtests.html in any browser

##Usage In Node
var htmlparser = require("htmlparser");
var rawHtml = "Xyz <script language= javascript>var foo = '<<bar>>';< / script><!--<!-- Waah! -- -->";
var handler = new htmlparser.DefaultHandler(function (error, dom) {
if (error)
[...do something for errors...]
else
[...parsing done, do something...]
});
var parser = new htmlparser.Parser(handler);
parser.parseComplete(rawHtml);
sys.puts(sys.inspect(handler.dom, false, null));

##Usage In Browser
var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error, dom) {
if (error)
[...do something for errors...]
else
[...parsing done, do something...]
});
var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
parser.parseComplete(document.body.innerHTML);
alert(JSON.stringify(handler.dom, null, 2));

##Example output
[ { raw: 'Xyz ', data: 'Xyz ', type: 'text' }
, { raw: 'script language= javascript'
, data: 'script language= javascript'
, type: 'script'
, name: 'script'
, attribs: { language: 'javascript' }
, children:
[ { raw: 'var foo = \'<bar>\';<'
, data: 'var foo = \'<bar>\';<'
, type: 'text'
}
]
}
, { raw: '<!-- Waah! -- '
, data: '<!-- Waah! -- '
, type: 'comment'
}
]

##Streaming To Parser
while (...) {
...
parser.parseChunk(chunk);
##Installing
npm install htmlparser2

A live demo of htmlparser2 is available at http://demos.forbeslindesay.co.uk/htmlparser2/

##Usage

```javascript
var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
onopentag: function(name, attribs){
if(name === "script" && attribs.type === "text/javascript"){
console.log("JS! Hooray!");
}
},
ontext: function(text){
console.log("-->", text);
},
onclosetag: function(tagname){
if(tagname === "script"){
console.log("That's it?!");
}
}
parser.done();
});
parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</ script>");
parser.end();
```

##Parsing RSS/Atom Feeds
Output (simplified):

new htmlparser.RssHandler(function (error, dom) {
...
});
```javascript
--> Xyz
JS! Hooray!
--> var foo = '<<bar>>';
That's it?!
```

##DefaultHandler Options
Read more about the parser in the [wiki](https://github.com/fb55/htmlparser2/wiki/Parser-options).

###Usage
var handler = new htmlparser.DefaultHandler(
function (error) { ... }
, { verbose: false, ignoreWhitespace: true }
);

###Option: ignoreWhitespace
Indicates whether the DOM should exclude text nodes that consists solely of whitespace. The default value is "false".

####Example: true
The following HTML:
<font>
<br>this is the text
<font>
becomes:
[ { raw: 'font'
, data: 'font'
, type: 'tag'
, name: 'font'
, children:
[ { raw: 'br', data: 'br', type: 'tag', name: 'br' }
, { raw: 'this is the text\n'
, data: 'this is the text\n'
, type: 'text'
}
, { raw: 'font', data: 'font', type: 'tag', name: 'font' }
]
}
]

####Example: false
The following HTML:
<font>
<br>this is the text
<font>
becomes:
[ { raw: 'font'
, data: 'font'
, type: 'tag'
, name: 'font'
, children:
[ { raw: '\n\t', data: '\n\t', type: 'text' }
, { raw: 'br', data: 'br', type: 'tag', name: 'br' }
, { raw: 'this is the text\n'
, data: 'this is the text\n'
, type: 'text'
}
, { raw: 'font', data: 'font', type: 'tag', name: 'font' }
]
}
]

###Option: verbose
Indicates whether to include extra information on each node in the DOM. This information consists of the "raw" attribute (original, unparsed text found between "<" and ">") and the "data" attribute on "tag", "script", and "comment" nodes. The default value is "true".

####Example: true
The following HTML:
<a href="test.html">xxx</a>
becomes:
[ { raw: 'a href="test.html"'
, data: 'a href="test.html"'
, type: 'tag'
, name: 'a'
, attribs: { href: 'test.html' }
, children: [ { raw: 'xxx', data: 'xxx', type: 'text' } ]
}
]

####Example: false
The following HTML:
<a href="test.html">xxx</a>
becomes:
[ { type: 'tag'
, name: 'a'
, attribs: { href: 'test.html' }
, children: [ { data: 'xxx', type: 'text' } ]
}
]

###Option: enforceEmptyTags
Indicates whether the DOM should prevent children on tags marked as empty in the HTML spec. Typically this should be set to "true" HTML parsing and "false" for XML parsing. The default value is "true".

####Example: true
The following HTML:
<link>text</link>
becomes:
[ { raw: 'link', data: 'link', type: 'tag', name: 'link' }
, { raw: 'text', data: 'text', type: 'text' }
]

####Example: false
The following HTML:
<link>text</link>
becomes:
[ { raw: 'link'
, data: 'link'
, type: 'tag'
, name: 'link'
, children: [ { raw: 'text', data: 'text', type: 'text' } ]
}
]

##DomUtils

###TBD (see utils_example.js for now)

##Related Projects

Looking for CSS selectors to search the DOM? Try Node-SoupSelect, a port of SoupSelect to NodeJS: http://github.com/harryf/node-soupselect

There's also a port of hpricot to NodeJS that uses HtmlParser for HTML parsing: http://github.com/silentrob/Apricot
##Get a DOM
The `DomHandler` (known as `DefaultHandler` in the original `htmlparser` module) produces a DOM (document object model) that can be manipulated using the [`DomUtils`](https://github.com/fb55/DomUtils) helper.

The `DomHandler`, while still bundled with this module, was moved to its [own module](https://github.com/fb55/domhandler). Have a look at it for further information.

##Parsing RSS/RDF/Atom Feeds

```javascript
new htmlparser.FeedHandler(function(<error> error, <object> feed){
...
});
```

##Performance

After having some artificial benchmarks for some time, __@AndreasMadsen__ published his [`htmlparser-benchmark`](https://github.com/AndreasMadsen/htmlparser-benchmark), which benchmarks HTML parses based on real-world websites.

At the time of writing, the latest versions of all supported parsers show the following performance characteristics on [Travis CI](https://travis-ci.org/AndreasMadsen/htmlparser-benchmark/builds/10805007) (please note that Travis doesn't guarantee equal conditions for all tests):

```
gumbo-parser : 34.9208 ms/file ± 21.4238
html-parser : 24.8224 ms/file ± 15.8703
html5 : 419.597 ms/file ± 264.265
htmlparser : 60.0722 ms/file ± 384.844
htmlparser2-dom: 12.0749 ms/file ± 6.49474
htmlparser2 : 7.49130 ms/file ± 5.74368
hubbub : 30.4980 ms/file ± 16.4682
libxmljs : 14.1338 ms/file ± 18.6541
parse5 : 22.0439 ms/file ± 15.3743
sax : 49.6513 ms/file ± 26.6032
```

##How is this different from [node-htmlparser](https://github.com/tautologistics/node-htmlparser)?
This is a fork of the `htmlparser` module. The main difference is that this is intended to be used only with node (it runs on other platforms using [browserify](https://github.com/substack/node-browserify)). `htmlparser2` was rewritten multiple times and, while it maintains an API that's compatible with `htmlparser` in most cases, the projects don't share any code anymore.

The parser now provides a callback interface close to [sax.js](https://github.com/isaacs/sax-js) (originally targeted at [readabilitySAX](https://github.com/fb55/readabilitysax)). As a result, old handlers won't work anymore.

The `DefaultHandler` and the `RssHandler` were renamed to clarify their purpose (to `DomHandler` and `FeedHandler`). The old names are still available when requiring `htmlparser2`, so your code should work as expected.
Loading