atom feed2 messages in com.googlegroups.clojurexml.clj + tagsoup
FromSent OnAttachments
ChouserFeb 22, 2008 11:40 am 
John CowanFeb 22, 2008 12:51 pm 
Subject:xml.clj + tagsoup
From:Chouser (chou@gmail.com)
Date:Feb 22, 2008 11:40:40 am
List:com.googlegroups.clojure

I seem to be more or less constantly writing HTML screen-scrapers, but I have yet to find a really nice way to do it. Maybe clojure will be my salvation! With that as my goal, I tried to integrate TagSoup with clojure's xml.clj, and it seems to work quite nicely.

Just replace xml.clj's parse function with:

(defn startparse-sax [s ch] (.. SAXParserFactory (newInstance) (newSAXParser) (parse s ch)))

(defn parse ([s] (parse s startparse-sax)) ([s startparse] (binding [*stack* nil *current* (struct element) *state* :between *sb* nil] (startparse s content-handler) ((:content *current*) 0))))

Now (xml/parse "foo.xml") works as it did before, but you can plug in other parsers if you want. For TagSoup:

(defn startparse-tagsoup [s ch] (let [p (new org.ccil.cowan.tagsoup.Parser)] (. p (setContentHandler ch)) (. p (parse s))))

(xml/parse "foo.html" startparse-tagsoup)

And you're off and running. Now all we need is a nice query language for the vector/map tree that gives you...