| From | Sent On | Attachments |
|---|---|---|
| Chouser | Feb 22, 2008 11:40 am | |
| John Cowan | Feb 22, 2008 12:51 pm |
| Subject: | xml.clj + tagsoup | |
|---|---|---|
| From: | Chouser (chou...@gmail.com) | |
| Date: | Feb 22, 2008 11:40:40 am | |
| List: | com.googlegroups.clojure | |
I seem to be more or less constantly writing HTML screen-scrapers, but I have yet to find a really nice way to do it. Maybe clojure will be my salvation! With that as my goal, I tried to integrate TagSoup with clojure's xml.clj, and it seems to work quite nicely.
Just replace xml.clj's parse function with:
(defn startparse-sax [s ch] (.. SAXParserFactory (newInstance) (newSAXParser) (parse s ch)))
(defn parse ([s] (parse s startparse-sax)) ([s startparse] (binding [*stack* nil *current* (struct element) *state* :between *sb* nil] (startparse s content-handler) ((:content *current*) 0))))
Now (xml/parse "foo.xml") works as it did before, but you can plug in other parsers if you want. For TagSoup:
(defn startparse-tagsoup [s ch] (let [p (new org.ccil.cowan.tagsoup.Parser)] (. p (setContentHandler ch)) (. p (parse s))))
(xml/parse "foo.html" startparse-tagsoup)
And you're off and running. Now all we need is a nice query language for the vector/map tree that gives you...





