I have recently started learning Clojure and I must say I am totally hooked. Clojure is a sane Lisp on the JVM. So I can express myself better while being able to take advantage of the huge JVM eco system. This is just wonderful!
After finishing the popular book Clojure for the Brave and True, I wanted to try something out myself. I decided to try web scraping with clojure. In this post, I am going to walk you through a very simple web scraping task.
We’re going to scrape KAT (Kick Ass Torrents) for TV series. To keep things simple, we would scrape the first page of the TV section and print out the titles. The reason I like KAT is they serve the response gzipped – if your http client can’t handle their response, you probably want to switch.
We will use the following libraries for the task:
- http-kit
- Enlive
Most tutorials for Clojure would use Java’s built in URL class with Enlive’s html-resource
function but in our case it would not work, because it can’t handle compressed responses well. So we will use http-kit
instead.
To begin with, we would add these libraries to our project.clj
file (assuming we’re using Leiningen).
1 2 3 4 5 6 |
(defproject cljnoob "0.0.1" :description "A very simple project to learn Clojure!" :main cljnoob.core :dependencies [[org.clojure/clojure "1.8.0"] [enlive "1.1.6"] [http-kit "2.1.18"]]) |
Now we’re ready to start writing the codes. Let’s first :require
our libraries.
1 2 3 |
(ns cljnoob.core (:require [net.cgrand.enlive-html :as html] [org.httpkit.client :as http])) |
First we have to fetch the HTML response. We can use http-kit’s get function to grab the HTML. This function would return a promise
. So we would have to dereference it using deref
or the shorthand syntax @
. When the promise is resolved, we would get a hashmap which would have a :body
key along with :status
and few other keys related to the request. We can pass this HTML response to Enlive’s html-snippet
function to get an iterable DOM like object from which we can select the elements using select
function.
1 2 3 4 |
(defn get-dom [] (html/html-snippet (:body @(http/get "http://kat.cr/tv/" {:insecure? true})))) |
We are using the {:insecure? true}
part to ignore issues with SSL. So far, we have a function get-dom
which would give us a DOM like object on which we can do select
. We will now write another function which will extract the titles from this DOM like object.
Each Torrent title (which is a link, aka anchor tag) has the CSS class cellMainLink
so we can select a.cellMainLink
to get all the title links. Each title link would have their text part in the :content
key. Each text part in the :content
key is a vector. So we would need to use first
on it to grab the actual text. Here’s what I wrote:
1 2 3 4 |
(defn extract-titles [dom] (map (comp first :content) (html/select dom [:a.cellMainLink]))) |
I simply could not resist using comp
to do some magic here. comp
allows us to combine two functions to compose one which allowed us to first grab the content and then get the first element in our case.
Finally, we can run our functions like this:
1 2 3 4 |
(defn -main [] (let [titles (extract-titles (get-dom))] (println titles))) |
Here’s the complete file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
(ns cljnoob.core (:require [net.cgrand.enlive-html :as html] [org.httpkit.client :as http])) (defn get-dom [] (html/html-snippet (:body @(http/get "http://kat.cr/tv/" {:insecure? true})))) (defn extract-titles [dom] (map (comp first :content) (html/select dom [:a.cellMainLink]))) (defn -main [] (let [titles (extract-titles (get-dom))] (println titles))) |
The code is under 20 lines! 😀