Categories
Python

Python: A quick introduction to the concurrent.futures module

The concurrent.futures module is part of the standard library which provides a high level API for launching async tasks. We will discuss and go through code samples for the common usages of this module.

Executors

This module features the Executor class which is an abstract class and it can not be used directly. However it has two very useful concrete subclasses – ThreadPoolExecutor and ProcessPoolExecutor. As their names suggest, one uses multi threading and the other one uses multi-processing. In both case, we get a pool of threads or processes and we can submit tasks to this pool. The pool would assign tasks to the available resources (threads or processes) and schedule them to run.

ThreadPoolExecutor

Let’s first see some codes:

I hope the code is pretty self explanatory. We first construct a ThreadPoolExecutor with the number of threads we want in the pool. By default the number is 5 but we chose to use 3 just because we can ;-). Then we submitted a task to the thread pool executor which waits 5 seconds before returning the message it gets as it’s first argument. When we submit() a task, we get back a Future. As we can see in the docs, the Future object has a method – done() which tells us if the future has resolved, that is a value has been set for that particular future object. When a task finishes (returns a value or is interrupted by an exception), the thread pool executor sets the value to the future object.

In our example, the task doesn’t complete until 5 seconds, so the first call to done() will return False. We take a really short nap for 5 secs and then it’s done. We can get the result of the future by calling the result() method on it.

A good understanding of the Future object and knowing it’s methods would be really beneficial for understanding and doing async programming in Python. So I highly recommend taking the time to read through the docs.

ProcessPoolExecutor

The process pool executor has a very similar API. So let’s modify our previous example and use ProcessPool instead:

It works perfectly! But of course, we would want to use the ProcessPoolExecutor for CPU intensive tasks. The ThreadPoolExecutor is better suited for network operations or I/O.

While the API is similar, we must remember that the ProcessPoolExecutor uses the multiprocessing module and is not affected by the Global Interpreter Lock. However, we can not use any objects that is not picklable. So we need to carefully choose what we use/return inside the callable passed to process pool executor.

Executor.map()

Both executors have a common method – map(). Like the built in function, the map method allows multiple calls to a provided function, passing each of the items in an iterable to that function. Except, in this case, the functions are called concurrently. For multiprocessing, this iterable is broken into chunks and each of these chunks is passed to the function in separate processes. We can control the chunk size by passing a third parameter, chunk_size. By default the chunk size is 1.

Here’s the ThreadPoolExample from the official docs:

And the ProcessPoolExecutor example:

as_completed() & wait()

The concurrent.futures module has two functions for dealing with the futures returned by the executors. One is as_completed() and the other one is wait().

The as_completed() function takes an iterable of Future objects and starts yielding values as soon as the futures start resolving. The main difference between the aforementioned map method with as_completed is that map returns the results in the order in which we pass the iterables. That is the first result from the map method is the result for the first item. On the other hand, the first result from the as_completed function is from whichever future completed first.

Let’s see an example:

The wait() function would return a named tuple which contains two set – one set contains the futures which completed (either got result or exception) and the other set containing the ones which didn’t complete.

We can see an example here:

We can control the behavior of the wait function by defining when it should return. We can pass one of these values to the return_when param of the function: FIRST_COMPLETED, FIRST_EXCEPTION and ALL_COMPLETED. By default, it’s set to ALL_COMPLETED, so the wait function returns only when all futures complete. But using that parameter, we can choose to return when the first future completes or first exception encounters.

Categories
Clojure

Parsing Upwork Job Feed to Monitor Clojure Jobs

I was checking Upwork to asses the job market for Clojure and it hit me – I can parse the Upwork Job Feed for Clojure and monitor it programmatically. So I fired up the REPL and started coding.

Before I began, I had to choose a Clojure library to parse RSS feeds. I went for https://github.com/scsibug/feedparser-clj. So I added this dependency ([org.clojars.scsibug/feedparser-clj "0.4.0"]) to my project.clj:

Now we can start writing some codes. First, we would fetch the content of the RSS feed and parse it. The parse-feed function from the above mentioned library would do that for us.

Next, we need a function to extract the data we need. We will run this function (map) over the collection of items.

Here we’re simply getting the values of :title key and :uri key and putting them in another hashmap. We’re naming our key :url instead of their :uri

We can grab the collection of items in the :entries key of the feed variable we declared before. So here’s our main function:

We’re mapping the function we wrote over the entries and getting a collection of hashmaps. Then we’re using doseq to iterate over them and print the data out.

The final code looks like this:

Here, we have extracted only two fields and printed them out. We extracted the data into a new hashmap as an example. As a matter of fact, we could just print them out from the original feed variable. Then the code would have been shorter:

Categories
Clojure

Web Scraping with Clojure

I have recently started learning Clojure and I must say I am totally hooked. Clojure is a sane Lisp on the JVM. So I can express myself better while being able to take advantage of the huge JVM eco system. This is just wonderful!

After finishing the popular book Clojure for the Brave and True, I wanted to try something out myself. I decided to try web scraping with clojure. In this post, I am going to walk you through a very simple web scraping task.

We’re going to scrape KAT (Kick Ass Torrents) for TV series. To keep things simple, we would scrape the first page of the TV section and print out the titles. The reason I like KAT is they serve the response gzipped – if your http client can’t handle their response, you probably want to switch.

We will use the following libraries for the task:

  • http-kit
  • Enlive

Most tutorials for Clojure would use Java’s built in URL class with Enlive’s html-resource function but in our case it would not work, because it can’t handle compressed responses well. So we will use http-kit instead.

To begin with, we would add these libraries to our project.clj file (assuming we’re using Leiningen).

Now we’re ready to start writing the codes. Let’s first :require our libraries.

First we have to fetch the HTML response. We can use http-kit’s get function to grab the HTML. This function would return a promise. So we would have to dereference it using deref or the shorthand syntax @. When the promise is resolved, we would get a hashmap which would have a :body key along with :status and few other keys related to the request. We can pass this HTML response to Enlive’s html-snippet function to get an iterable DOM like object from which we can select the elements using select function.

We are using the {:insecure? true} part to ignore issues with SSL. So far, we have a function get-dom which would give us a DOM like object on which we can do select. We will now write another function which will extract the titles from this DOM like object.

Each Torrent title (which is a link, aka anchor tag) has the CSS class cellMainLink so we can select a.cellMainLink to get all the title links. Each title link would have their text part in the :content key. Each text part in the :content key is a vector. So we would need to use first on it to grab the actual text. Here’s what I wrote:

I simply could not resist using comp to do some magic here. comp allows us to combine two functions to compose one which allowed us to first grab the content and then get the first element in our case.

Finally, we can run our functions like this:

Here’s the complete file:

The code is under 20 lines! 😀