-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
193 additions
and
38 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,58 @@ | ||
This is a port of the ideas of Parsley to Javascript. There's decent comments in jquery.parsley.js. | ||
This is a port of the core ideas of Parsley from C to Javascript and jQuery. Parsley is a domain-specific language for extracting content from HTML. It adds two idioms to jQuery. | ||
|
||
Here's the yc parselet ( http://parselets.com/parselets/yc ) in | ||
idiomatic JavaScript. I'd like an opinion. | ||
The first addition is the extract() method. This transforms a jQuery object acting as a node | ||
list into a StringNodeList list of strings. | ||
|
||
var parselet = ({ | ||
For example, let's perform some extractions on the following HTML: | ||
|
||
<html><a href="/">Home</a><a href="http://google.com">Google</a></html> | ||
|
||
js> jQuery("a").extract() | ||
<StringNodeList[<Home(1)>, <Google(2)>]> | ||
js> jQuery("a").extract().simple() | ||
["Home", "Google"] | ||
|
||
You can also pass regexen, attributes, or arbitrary functions to extract(). | ||
|
||
js> jQuery("a").extract("@href").simple() | ||
["/", "http://google.com"] | ||
js> jQuery("a").extract(/[A-Z]/).simple() | ||
["H", "G"] | ||
js> jQuery("a").extract(function(node){ return "hi"; }).simple() | ||
["hi", "hi"] | ||
|
||
The second idiom is auto-grouping. Individual extractions can be grouped into a larger data structure (called a parselet), which will parse the page in an intelligent way. Here's the naive ungrouped way to use extract: | ||
|
||
js> var parselet = { | ||
links: [{ | ||
text: $("a").extract().simple(), | ||
href: $("a").extract("@href").simple() | ||
}] | ||
}; | ||
{ links: [{ text: ["Home", "Google"], href: ["/", "http://google.com"] }] } | ||
|
||
Now, let's add grouping by calling pQuery.extractAndGroup() to transform the data structure into something more convenient. extractAndGroup() will automatically call extract() and simple() as necessary, so this time we'll omit them. | ||
|
||
js> pQuery.extractAndGroup({ | ||
links: [{ | ||
text: $("a") | ||
href: $("a").extract("@href") | ||
}] | ||
}); | ||
{ links: [{ text: "Home", href: "/"}, {text: "Google", href: "http://google.com"}]} | ||
|
||
Now the links array has two objects, each representing one link. This is a much better representation of the data. | ||
|
||
The goal here is to create a crawler that takes the inner {links: ...} object as input, and from that generates a json or csv representaion of an entire website. | ||
|
||
Here's an example parselet that gets a list of stories from http://news.ycombinator.com. | ||
|
||
{ | ||
articles: [{ | ||
title: $(".title a"), | ||
title_verbose: $(".title a").extract(function(node){ | ||
// This function callback does the same thing as the | ||
// default handler. It's just here for the example, to show | ||
// how to inject arbitrary logic. | ||
return $(node).text().normalizeSpace(); | ||
}), | ||
link: $(".title a").extract("@href"), | ||
comment_count: $(".subtext a:nth-child(3)").extract(/0-9+/).optional(), | ||
comment_link: $(".subtext a:nth-child(3)").extract("@href"), | ||
comment_count: $(".subtext a:nth-child(3)").extract(/0-9+/), | ||
comment_link: $(".subtext a:nth-child(3)").extract("@href"), | ||
points: $(".subtext span").extract(/0-9+/) | ||
}], | ||
next: $(".title:nth-child(2) a").extract("@href"); | ||
}) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters