Greg Heartsfield home

Parsing feeds with Clojure

First, let me say thanks to ThoughtWorks Dallas and Paul Hammant for organizing Geeknight Dallas, Cosmin Stejerean for the idea and assistance with Clojure, and Richard Wiltshire for being my coding partner at the meetup.

This post introduces a small feed parsing utility for Clojure. The intent was to produce something similar to Python’s inimitable Universal Feed Parser. There are a number of RSS/Atom parsing libraries for Java, but for most the interface seems to be lacking in comparison with feedparser, and all would have been clumsier to use in clojure than necessary.

We evaluated several Java feed parsing libraries, ROME came out on top. It has limited dependencies (just JDOM), supports a wide variety of feed types, and presents a simple and uniform interface for accessing different feed types. Unfortunately, unlike feedparser, it does not perform HTML sanitization.

To get started using feedparser-clj, download and install the library from the github repository:

git clone git://github.com/scsibug/feedparser-clj.git
cd feedparser-clj
lein install

Once the library is installed and in your classpath, we import it into a REPL session:

user=> (ns user (:use feedparser-clj.core)
           (:require [clojure.contrib.string :as string]))

Retrieve and parse a feed with parse-feed and a URL.

user=> (def f (parse-feed "http://gregheartsfield.com/atom.xml"))

f is now a map that can be accessed by key to retrieve feed information.

user=> (keys f)
(:authors :categories :contributors :copyright :description
 :encoding :entries :feed-type :image :language :link
 :entry-links :published-date :title :uri)

Keys are applied to the feed to give values, or nil if it was undefined.

user=> (:title f)
"Greg Heartsfield"

Some feed attributes are maps themselves (like :image) or lists of maps (like :entries and :authors).

user=> (map :email (:authors f))
("scsibug@imap.cc")

Check how many entries are in the feed:

user=> (count (:entries f))
18

Determine the feed type:

user=> (:feed-type f)
"atom_1.0"

Look at the first few entry titles:

user=> (map :title (take 3 (:entries f)))
("Version Control Diagrams with TikZ"
 "Introducing cabal2doap"
 "hS3, with ByteString")

Find the most recently updated entry’s title:

user=> (first (map :title (reverse (sort-by :updated-date entries))))
"Version Control Diagrams with TikZ"

Compute what percentage of entries have the word “haskell” in the body (uses clojure.contrib.string):

user=> (let [es (:entries f)]
           (* 100.0 (/ (count (filter #(string/substring? "haskell"
               (:value (first (:contents %)))) es))
           (count es))))
55.55555555555556

This library is just a thin wrapping over ROME (currently less than 100 LOC), but for quickly parsing and exploring feeds in clojure, it should be more convenient than using native java libraries. Look for some improvements from future meetups (or if you’re in DFW/Texas, come out and join us), but contributions on github are appreciated.

Validate XHTML Validate CSS