Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NodeJS compatibility, use getRawTextContent #33

Merged
merged 11 commits into from
Nov 20, 2022
23 changes: 21 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ user=> (-> (s/select (s/descendant (s/class "subModule")
(s/tag :a))
site-htree)
first :content first string/trim)
"Sebastian Vettel"
"Sebastian Vettel"
```

Our fears are confirmed, Sebastian Vettel is well on his way to a fourth consecutive championship. If you were to inspect the page by hand (as of around May 2013, at least), you would see that unlike the `child` selector we used in the example above, the `descendant` selector allows the argument selectors to skip stages in the tree; we've left out some elements in this descendant relationship. The first table row in the driver standings table is selected with the `and`, `tag` and `first-child` selectors, and then the second `td` element is chosen, which is the element that has the driver's name (the first table element has the driver's standing) inside an `A` element. All of this is dependent on the exact layout of the HTML in the site we are examining, of course, but it should give an idea of how you can combine selectors to reach into a specific node of an HTML document very easily.
Expand Down Expand Up @@ -257,7 +257,26 @@ to your project.clj, or an equivalent entry for your Maven-compatible build tool

## ClojureScript support

Hickory expects a DOM implementation and thus won't work out of the box on node. On browsers it works for IE9+ (you can find a workaround for IE9 [here](http://stackoverflow.com/questions/9250545/javascript-domparser-access-innerhtml-and-other-properties)).
Hickory works for all web browsers IE9+ (you can find a workaround for IE9 [here](http://stackoverflow.com/questions/9250545/javascript-domparser-access-innerhtml-and-other-properties)).

## Nodejs support

To parse markup on Nodejs, Hickory requires a Node DOM implementation.
Several are available from [npm](https://www.npmjs.com).
Install the npm package or use [lein-npm](https://github.com/RyanMcG/lein-npm).
Here are some alternatives:

- [jsdom](https://www.npmjs.com/package/jsdom) - **Caution:** this will not work if you're using figwheel

```clojure
(set! js/document (.jsdom (cljs.nodejs/require "jsdom")))
```

- [xmldom](https://www.npmjs.com/package/xmldom)

```clojure
(set! js/DOMParser (.-DOMParser (cljs.nodejs/require "xmldom")))
```

## Changes

Expand Down
42 changes: 16 additions & 26 deletions src/cljs/hickory/core.cljs
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
(ns hickory.core
(:require [hickory.utils :as utils]
[clojure.zip :as zip]
[goog.string :as gstring]))
[goog.string :as gstring]
[goog.dom :as dom]))

;;
;; Protocols
Expand Down Expand Up @@ -34,7 +35,7 @@

(defn node-type
[type]
(aget js/Node (str type "_NODE")))
(aget dom/NodeType type))

(def Attribute (node-type "ATTRIBUTE"))
(def Comment (node-type "COMMENT"))
Expand All @@ -43,19 +44,8 @@
(def Element (node-type "ELEMENT"))
(def Text (node-type "TEXT"))

(defn extend-type-with-seqable
[t]
(extend-type t
ISeqable
(-seq [array] (array-seq array))))

(extend-type-with-seqable js/NodeList)

(when (exists? js/NamedNodeMap)
(extend-type-with-seqable js/NamedNodeMap))

(when (exists? js/MozNamedAttrMap) ;;NamedNodeMap has been renamed on modern gecko implementations (see https://developer.mozilla.org/en-US/docs/Web/API/NamedNodeMap)
(extend-type-with-seqable js/MozNamedAttrMap))
(defn- as-seq [nodelist]
(if (seq? nodelist) nodelist (array-seq nodelist)))

(defn format-doctype
[dt]
Expand All @@ -72,7 +62,7 @@
Attribute [(utils/lower-case-keyword (aget this "name"))
(aget this "value")]
Comment (str "<!--" (aget this "data") "-->")
Document (map as-hiccup (aget this "childNodes"))
Document (map as-hiccup (as-seq (aget this "childNodes")))
DocumentType (format-doctype this)
;; There is an issue with the hiccup format, which is that it
;; can't quite cover all the pieces of HTML, so anything it
Expand All @@ -88,11 +78,11 @@
;; unescapable nodes.
Element (let [tag (utils/lower-case-keyword (aget this "tagName"))]
(into [] (concat [tag
(into {} (map as-hiccup (aget this "attributes")))]
(into {} (map as-hiccup (as-seq (aget this "attributes"))))]
(if (utils/unescapable-content tag)
(map #(aget % "wholeText") (aget this "childNodes"))
(map as-hiccup (aget this "childNodes"))))))
Text (utils/html-escape (aget this "wholeText")))))
(map dom/getRawTextContent (as-seq (aget this "childNodes")))
(map as-hiccup (as-seq (aget this "childNodes")))))))
Text (utils/html-escape (dom/getRawTextContent this)))))

(extend-protocol HickoryRepresentable
object
Expand All @@ -103,18 +93,18 @@
Document {:type :document
:content (not-empty
(into [] (map as-hickory
(aget this "childNodes"))))}
(as-seq (aget this "childNodes")))))}
DocumentType {:type :document-type
:attrs {:name (aget this "name")
:publicid (aget this "publicId")
:systemid (aget this "systemId")}}
Element {:type :element
:attrs (not-empty (into {} (map as-hickory (aget this "attributes"))))
:attrs (not-empty (into {} (map as-hickory (as-seq (aget this "attributes")))))
:tag (utils/lower-case-keyword (aget this "tagName"))
:content (not-empty
(into [] (map as-hickory
(aget this "childNodes"))))}
Text (aget this "wholeText"))))
(as-seq (aget this "childNodes")))))}
Text (dom/getRawTextContent this))))

(defn extract-doctype
[s]
Expand All @@ -140,7 +130,7 @@
doctype-el (aget doc "doctype")]
(when-not (extract-doctype s);; Remove default doctype if parsed string does not define it.
(remove-el doctype-el))
(when-let [title-el (first (aget doc "head" "childNodes"))];; Remove default title if parsed string does not define it.
(when-let [title-el (aget doc "head" "firstChild")];; Remove default title if parsed string does not define it.
(when (empty? (aget title-el "text"))
(remove-el title-el)))
(.write doc s)
Expand All @@ -157,4 +147,4 @@
in the tag hierarchy under <body>) into a list of DOM elements that can
each be passed as input to as-hiccup or as-hickory."
[s]
(aget (parse s) "body" "childNodes"))
(as-seq (aget (parse s) "body" "childNodes")))
18 changes: 10 additions & 8 deletions test/cljc/hickory/test/core.cljc
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
[:a {:id "so", :href "bar"} "bar"]
[:script {:src "blah.js"} "alert(\"hi\");"]]]]
(as-hiccup (parse "<!DOCTYPE html><a href=\"foo\">foo</a> <a id=\"so\" href=\"bar\">bar</a><script src=\"blah.js\">alert(\"hi\");</script>"))))

(is (= {:type :document,
:content [{:type :document-type,
:attrs {:name "html", :publicid "", :systemid ""}}
Expand Down Expand Up @@ -47,14 +48,15 @@
;; and cdata nodes.
(deftest basic-documents2
(is (= ["<!DOCTYPE html>"
[:html {}
[:head {}]
[:body {}
"<!--comment-->"
[:a {:href "foo"} "foo"] " "
[:a {:id "so", :href "bar"} "bar"]
[:script {:src "blah.js"} "alert(\"hi\");"]]]]
(as-hiccup (parse "<!DOCTYPE html><body><!--comment--><a href=\"foo\">foo</a> <a id=\"so\" href=\"bar\">bar</a><script src=\"blah.js\">alert(\"hi\");</script></body>"))))
[:html {}
[:head {}]
[:body {}
"<!--comment-->"
[:a {:href "foo"} "foo"] " "
[:a {:id "so", :href "bar"} "bar"]
[:script {:src "blah.js"} "alert(\"hi\");"]]]]
(as-hiccup (parse "<!DOCTYPE html><body><!--comment--><a href=\"foo\">foo</a> <a id=\"so\" href=\"bar\">bar</a><script src=\"blah.js\">alert(\"hi\");</script></body>"))))

(is (= {:type :document,
:content [{:type :document-type,
:attrs {:name "html", :publicid "", :systemid ""}}
Expand Down