Parsing Upwork Job Feed to Monitor Clojure Jobs

I was checking Upwork to asses the job market for Clojure and it hit me – I can parse the Upwork Job Feed for Clojure and monitor it programmatically. So I fired up the REPL and started coding.

Before I began, I had to choose a Clojure library to parse RSS feeds. I went for https://github.com/scsibug/feedparser-clj. So I added this dependency ([org.clojars.scsibug/feedparser-clj "0.4.0"]) to my project.clj:

Now we can start writing some codes. First, we would fetch the content of the RSS feed and parse it. The parse-feed function from the above mentioned library would do that for us.

Next, we need a function to extract the data we need. We will run this function (map) over the collection of items.

Here we’re simply getting the values of :title key and :uri key and putting them in another hashmap. We’re naming our key :url instead of their :uri

We can grab the collection of items in the :entries key of the feed variable we declared before. So here’s our main function:

We’re mapping the function we wrote over the entries and getting a collection of hashmaps. Then we’re using doseq to iterate over them and print the data out.

The final code looks like this:

Here, we have extracted only two fields and printed them out. We extracted the data into a new hashmap as an example. As a matter of fact, we could just print them out from the original feed variable. Then the code would have been shorter:

Web Scraping with Clojure

I have recently started learning Clojure and I must say I am totally hooked. Clojure is a sane Lisp on the JVM. So I can express myself better while being able to take advantage of the huge JVM eco system. This is just wonderful!

After finishing the popular book Clojure for the Brave and True, I wanted to try something out myself. I decided to try web scraping with clojure. In this post, I am going to walk you through a very simple web scraping task.

We’re going to scrape KAT (Kick Ass Torrents) for TV series. To keep things simple, we would scrape the first page of the TV section and print out the titles. The reason I like KAT is they serve the response gzipped – if your http client can’t handle their response, you probably want to switch.

We will use the following libraries for the task:

  • http-kit
  • Enlive

Most tutorials for Clojure would use Java’s built in URL class with Enlive’s html-resource function but in our case it would not work, because it can’t handle compressed responses well. So we will use http-kit instead.

To begin with, we would add these libraries to our project.clj file (assuming we’re using Leiningen).

Now we’re ready to start writing the codes. Let’s first :require our libraries.

First we have to fetch the HTML response. We can use http-kit’s get function to grab the HTML. This function would return a promise. So we would have to dereference it using deref or the shorthand syntax @. When the promise is resolved, we would get a hashmap which would have a :body key along with :status and few other keys related to the request. We can pass this HTML response to Enlive’s html-snippet function to get an iterable DOM like object from which we can select the elements using select function.

We are using the {:insecure? true} part to ignore issues with SSL. So far, we have a function get-dom which would give us a DOM like object on which we can do select. We will now write another function which will extract the titles from this DOM like object.

Each Torrent title (which is a link, aka anchor tag) has the CSS class cellMainLink so we can select a.cellMainLink to get all the title links. Each title link would have their text part in the :content key. Each text part in the :content key is a vector. So we would need to use first on it to grab the actual text. Here’s what I wrote:

I simply could not resist using comp to do some magic here. comp allows us to combine two functions to compose one which allowed us to first grab the content and then get the first element in our case.

Finally, we can run our functions like this:

Here’s the complete file:

The code is under 20 lines! 😀

Scripting Clojure

Clojure Startup time is not ideal for day to day scripting with it. But in case you do want to run Clojure scripts, it is possible.

We need to use the “lein exec” plugin to leiningen. Edit ~/.lein/profiles.clj and add these codes:

Now the next time you run any leiningen commands, it would install the plugin. You can now execute single file clojure programs by:

For more convenience, I added this to my .zshrc:

Now I can do this:

You can also use “lein exec” for shebang, like this:

Make the file executable and run it like:

Clojure: Re-running programs from the Lein REPL

I have started playing with Clojure lately and I found the Clojure start up time quite slow. Every time I make a change and re-run using:

I have to wait quite a bit of time.

For quick prototyping, a better approach is to run the repl using:

We run our main function from the repl using (-main). when we make changes – we reload our code and then run the “-main” function again. To follow this approach, we have to make sure that we added [org.clojure/tools.namespace “0.2.7”] to our leiningen dependencies list (add this to the project.clj file inside your project). Then launch the REPL and type these code:

The (refresh) call would reload the changed namespaces and (-main) would run the main function again.

So that works but every time we open the REPL, we have to type the require command in. One of the awesome features of Leiningen is it allows us to load source codes from additional paths based on profiles. We would add a dev profile and load some codes from a different path. These codes will be loaded by leiningen while running the app but won’t be a part of our JAR file.

To add a dev profile, let’s update the “project.clj”:

Now in our project root, if we create a file “startup/user.clj”, Leiningen would load this file on startup.

So we add these code to the “startup/user.clj” file:

Note that the refresh function takes an “:after” option. If we pass a fully qualified function name, it will call that function after reloading the namespace.

Now any time we make changes to our files, we can just call (user/rerun) and it would reload the libraries and execute the “-main” function.