Month: March 2016

Python

Python: A quick introduction to the concurrent.futures module

Post author By masnun
Post date March 29, 2016
30 Comments on Python: A quick introduction to the concurrent.futures module

The concurrent.futures module is part of the standard library which provides a high level API for launching async tasks. We will discuss and go through code samples for the common usages of this module.

Executors

This module features the Executor class which is an abstract class and it can not be used directly. However it has two very useful concrete subclasses – ThreadPoolExecutor and ProcessPoolExecutor. As their names suggest, one uses multi threading and the other one uses multi-processing. In both case, we get a pool of threads or processes and we can submit tasks to this pool. The pool would assign tasks to the available resources (threads or processes) and schedule them to run.

ThreadPoolExecutor

Let’s first see some codes:

from concurrent.futures import ThreadPoolExecutor
from time import sleep

def return_after_5_secs(message):
    sleep(5)
    return message

pool = ThreadPoolExecutor(3)

future = pool.submit(return_after_5_secs, ("hello"))
print(future.done())
sleep(5)
print(future.done())
print(future.result())

from concurrent.futures import ThreadPoolExecutor

from time import sleep

def return_after_5_secs(message):

sleep(5)

return message

pool = ThreadPoolExecutor(3)

future = pool.submit(return_after_5_secs, ("hello"))

print(future.done())

sleep(5)

print(future.done())

print(future.result())

I hope the code is pretty self explanatory. We first construct a ThreadPoolExecutor with the number of threads we want in the pool. By default the number is 5 but we chose to use 3 just because we can ;-). Then we submitted a task to the thread pool executor which waits 5 seconds before returning the message it gets as it’s first argument. When we submit() a task, we get back a Future. As we can see in the docs, the Future object has a method – done() which tells us if the future has resolved, that is a value has been set for that particular future object. When a task finishes (returns a value or is interrupted by an exception), the thread pool executor sets the value to the future object.

In our example, the task doesn’t complete until 5 seconds, so the first call to done() will return False. We take a really short nap for 5 secs and then it’s done. We can get the result of the future by calling the result() method on it.

A good understanding of the Future object and knowing it’s methods would be really beneficial for understanding and doing async programming in Python. So I highly recommend taking the time to read through the docs.

ProcessPoolExecutor

The process pool executor has a very similar API. So let’s modify our previous example and use ProcessPool instead:

from concurrent.futures import ProcessPoolExecutor
from time import sleep

def return_after_5_secs(message):
    sleep(5)
    return message

pool = ProcessPoolExecutor(3)

future = pool.submit(return_after_5_secs, ("hello"))
print(future.done())
sleep(5)
print(future.done())
print("Result: " + future.result())

from concurrent.futures import ProcessPoolExecutor

from time import sleep

def return_after_5_secs(message):

sleep(5)

return message

pool = ProcessPoolExecutor(3)

future = pool.submit(return_after_5_secs, ("hello"))

print(future.done())

sleep(5)

print(future.done())

print("Result: " + future.result())

It works perfectly! But of course, we would want to use the ProcessPoolExecutor for CPU intensive tasks. The ThreadPoolExecutor is better suited for network operations or I/O.

While the API is similar, we must remember that the ProcessPoolExecutor uses the multiprocessing module and is not affected by the Global Interpreter Lock. However, we can not use any objects that is not picklable. So we need to carefully choose what we use/return inside the callable passed to process pool executor.

Executor.map()

Both executors have a common method – map(). Like the built in function, the map method allows multiple calls to a provided function, passing each of the items in an iterable to that function. Except, in this case, the functions are called concurrently. For multiprocessing, this iterable is broken into chunks and each of these chunks is passed to the function in separate processes. We can control the chunk size by passing a third parameter, chunk_size. By default the chunk size is 1.

Here’s the ThreadPoolExample from the official docs:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

import concurrent.futures

import urllib.request

URLS = ['http://www.foxnews.com/',

'http://www.cnn.com/',

'http://europe.wsj.com/',

'http://www.bbc.co.uk/',

'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents

def load_url(url, timeout):

with urllib.request.urlopen(url, timeout=timeout) as conn:

return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:

# Start the load operations and mark each future with its URL

future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}

for future in concurrent.futures.as_completed(future_to_url):

url = future_to_url[future]

try:

data = future.result()

except Exception as exc:

print('%r generated an exception: %s' % (url, exc))

else:

print('%r page is %d bytes' % (url, len(data)))

And the ProcessPoolExecutor example:

import concurrent.futures
import math

PRIMES = [
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419]

def is_prime(n):
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

def main():
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
            print('%d is prime: %s' % (number, prime))

if __name__ == '__main__':
    main()

import concurrent.futures

import math

PRIMES = [

112272535095293,

112582705942171,

112272535095293,

115280095190773,

115797848077099,

1099726899285419]

def is_prime(n):

if n % 2 == 0:

return False

sqrt_n = int(math.floor(math.sqrt(n)))

for i in range(3, sqrt_n + 1, 2):

if n % i == 0:

return False

return True

def main():

with concurrent.futures.ProcessPoolExecutor() as executor:

for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):

print('%d is prime: %s' % (number, prime))

if __name__ == '__main__':

main()

as_completed() & wait()

The concurrent.futures module has two functions for dealing with the futures returned by the executors. One is as_completed() and the other one is wait().

The as_completed() function takes an iterable of Future objects and starts yielding values as soon as the futures start resolving. The main difference between the aforementioned map method with as_completed is that map returns the results in the order in which we pass the iterables. That is the first result from the map method is the result for the first item. On the other hand, the first result from the as_completed function is from whichever future completed first.

Let’s see an example:

from concurrent.futures import ThreadPoolExecutor, wait, as_completed
from time import sleep
from random import randint

def return_after_5_secs(num):
    sleep(randint(1, 5))
    return "Return of {}".format(num)

pool = ThreadPoolExecutor(5)
futures = []
for x in range(5):
    futures.append(pool.submit(return_after_5_secs, x))

for x in as_completed(futures):
    print(x.result())

from concurrent.futures import ThreadPoolExecutor, wait, as_completed

from time import sleep

from random import randint

def return_after_5_secs(num):

sleep(randint(1, 5))

return "Return of {}".format(num)

pool = ThreadPoolExecutor(5)

futures = []

for x in range(5):

futures.append(pool.submit(return_after_5_secs, x))

for x in as_completed(futures):

print(x.result())

The wait() function would return a named tuple which contains two set – one set contains the futures which completed (either got result or exception) and the other set containing the ones which didn’t complete.

We can see an example here:

from concurrent.futures import ThreadPoolExecutor, wait, as_completed
from time import sleep
from random import randint

def return_after_5_secs(num):
    sleep(randint(1, 5))
    return "Return of {}".format(num)

pool = ThreadPoolExecutor(5)
futures = []
for x in range(5):
    futures.append(pool.submit(return_after_5_secs, x))

print(wait(futures))

from concurrent.futures import ThreadPoolExecutor, wait, as_completed

from time import sleep

from random import randint

def return_after_5_secs(num):

sleep(randint(1, 5))

return "Return of {}".format(num)

pool = ThreadPoolExecutor(5)

futures = []

for x in range(5):

futures.append(pool.submit(return_after_5_secs, x))

print(wait(futures))

We can control the behavior of the wait function by defining when it should return. We can pass one of these values to the return_when param of the function: FIRST_COMPLETED, FIRST_EXCEPTION and ALL_COMPLETED. By default, it’s set to ALL_COMPLETED, so the wait function returns only when all futures complete. But using that parameter, we can choose to return when the first future completes or first exception encounters.

Clojure

Parsing Upwork Job Feed to Monitor Clojure Jobs

Post author By masnun
Post date March 29, 2016

I was checking Upwork to asses the job market for Clojure and it hit me – I can parse the Upwork Job Feed for Clojure and monitor it programmatically. So I fired up the REPL and started coding.

Before I began, I had to choose a Clojure library to parse RSS feeds. I went for https://github.com/scsibug/feedparser-clj. So I added this dependency ([org.clojars.scsibug/feedparser-clj "0.4.0"]) to my project.clj:

(defproject cljnoob "0.0.1"
  :description "A very simple project to learn Clojure!"
  :main cljnoob.core
  :profiles { :dev {:dependencies [[org.clojure/tools.namespace "0.2.11"]]
                  :source-paths ["dev"]}}
  :dependencies [[org.clojure/clojure "1.8.0"]
                 [org.clojars.scsibug/feedparser-clj "0.4.0"]])

(defproject cljnoob "0.0.1"

:description "A very simple project to learn Clojure!"

:main cljnoob.core

:profiles { :dev {:dependencies [[org.clojure/tools.namespace "0.2.11"]]

:source-paths ["dev"]}}

:dependencies [[org.clojure/clojure "1.8.0"]

[org.clojars.scsibug/feedparser-clj "0.4.0"]])

Now we can start writing some codes. First, we would fetch the content of the RSS feed and parse it. The parse-feed function from the above mentioned library would do that for us.

(def feed (parse-feed "https://www.upwork.com/ab/feed/jobs/rss?q=Clojure&api_params=1"))

1	(def feed (parse-feed "https://www.upwork.com/ab/feed/jobs/rss?q=Clojure&api_params=1"))

Next, we need a function to extract the data we need. We will run this function (map) over the collection of items.

(defn extract-title-and-url
  [item]
  (assoc {} :title (:title item) :url (:uri item)))

(defn extract-title-and-url

[item]

(assoc {} :title (:title item) :url (:uri item)))

Here we’re simply getting the values of :title key and :uri key and putting them in another hashmap. We’re naming our key :url instead of their :uri

We can grab the collection of items in the :entries key of the feed variable we declared before. So here’s our main function:

(defn -main
  []
  (let [jobs (map extract-title-and-url (:entries feed))]
    (doseq [item jobs] (println (str (:title item) " - " (:url item))))))

(defn -main

[]

(let [jobs (map extract-title-and-url (:entries feed))]

(doseq [item jobs] (println (str (:title item) " - " (:url item))))))

We’re mapping the function we wrote over the entries and getting a collection of hashmaps. Then we’re using doseq to iterate over them and print the data out.

The final code looks like this:

(ns cljnoob.core
  (:require [feedparser-clj.core :refer [parse-feed]]))

(def feed (parse-feed "https://www.upwork.com/ab/feed/jobs/rss?q=Clojure&api_params=1"))

(defn extract-title-and-url
  [item]
  (assoc {} :title (:title item) :url (:uri item)))

(defn -main
  []
  (let [jobs (map extract-title-and-url (:entries feed))]
    (doseq [item jobs] (println (str (:title item) " - " (:url item))))))

(ns cljnoob.core

(:require [feedparser-clj.core :refer [parse-feed]]))

(def feed (parse-feed "https://www.upwork.com/ab/feed/jobs/rss?q=Clojure&api_params=1"))

(defn extract-title-and-url

[item]

(assoc {} :title (:title item) :url (:uri item)))

(defn -main

[]

(let [jobs (map extract-title-and-url (:entries feed))]

(doseq [item jobs] (println (str (:title item) " - " (:url item))))))

Here, we have extracted only two fields and printed them out. We extracted the data into a new hashmap as an example. As a matter of fact, we could just print them out from the original feed variable. Then the code would have been shorter:

(ns cljnoob.core
  (:require [feedparser-clj.core :refer [parse-feed]]))
  
(defn -main
  []
  (let [jobs (:entries (parse-feed "https://www.upwork.com/ab/feed/jobs/rss?q=Clojure&api_params=1"))]
    (doseq [item jobs] (println (str (:title item) " - " (:uri item))))))

(ns cljnoob.core

(:require [feedparser-clj.core :refer [parse-feed]]))

(defn -main

[]

(let [jobs (:entries (parse-feed "https://www.upwork.com/ab/feed/jobs/rss?q=Clojure&api_params=1"))]

(doseq [item jobs] (println (str (:title item) " - " (:uri item))))))

Clojure

Web Scraping with Clojure

Post author By masnun
Post date March 20, 2016

I have recently started learning Clojure and I must say I am totally hooked. Clojure is a sane Lisp on the JVM. So I can express myself better while being able to take advantage of the huge JVM eco system. This is just wonderful!

After finishing the popular book Clojure for the Brave and True, I wanted to try something out myself. I decided to try web scraping with clojure. In this post, I am going to walk you through a very simple web scraping task.

We’re going to scrape KAT (Kick Ass Torrents) for TV series. To keep things simple, we would scrape the first page of the TV section and print out the titles. The reason I like KAT is they serve the response gzipped – if your http client can’t handle their response, you probably want to switch.

We will use the following libraries for the task:

http-kit
Enlive

Most tutorials for Clojure would use Java’s built in URL class with Enlive’s html-resource function but in our case it would not work, because it can’t handle compressed responses well. So we will use http-kit instead.

To begin with, we would add these libraries to our project.clj file (assuming we’re using Leiningen).

(defproject cljnoob "0.0.1"
  :description "A very simple project to learn Clojure!"
  :main cljnoob.core
  :dependencies [[org.clojure/clojure "1.8.0"]
                 [enlive "1.1.6"]
                 [http-kit "2.1.18"]])

(defproject cljnoob "0.0.1"

:description "A very simple project to learn Clojure!"

:main cljnoob.core

:dependencies [[org.clojure/clojure "1.8.0"]

[enlive "1.1.6"]

[http-kit "2.1.18"]])

Now we’re ready to start writing the codes. Let’s first :require our libraries.

(ns cljnoob.core
    (:require [net.cgrand.enlive-html :as html]
              [org.httpkit.client :as http]))

(ns cljnoob.core

(:require [net.cgrand.enlive-html :as html]

[org.httpkit.client :as http]))

First we have to fetch the HTML response. We can use http-kit’s get function to grab the HTML. This function would return a promise. So we would have to dereference it using deref or the shorthand syntax @. When the promise is resolved, we would get a hashmap which would have a :body key along with :status and few other keys related to the request. We can pass this HTML response to Enlive’s html-snippet function to get an iterable DOM like object from which we can select the elements using select function.

(defn get-dom
  []
  (html/html-snippet
      (:body @(http/get "http://kat.cr/tv/" {:insecure? true}))))

(defn get-dom

[]

(html/html-snippet

(:body @(http/get "http://kat.cr/tv/" {:insecure? true}))))

We are using the {:insecure? true} part to ignore issues with SSL. So far, we have a function get-dom which would give us a DOM like object on which we can do select. We will now write another function which will extract the titles from this DOM like object.

Each Torrent title (which is a link, aka anchor tag) has the CSS class cellMainLink so we can select a.cellMainLink to get all the title links. Each title link would have their text part in the :content key. Each text part in the :content key is a vector. So we would need to use first on it to grab the actual text. Here’s what I wrote:

(defn extract-titles
    [dom]
    (map
        (comp first :content) (html/select dom [:a.cellMainLink])))

(defn extract-titles

[dom]

(map

(comp first :content) (html/select dom [:a.cellMainLink])))

I simply could not resist using comp to do some magic here. comp allows us to combine two functions to compose one which allowed us to first grab the content and then get the first element in our case.

Finally, we can run our functions like this:

(defn -main
    []
    (let [titles (extract-titles (get-dom))]
        (println titles)))

(defn -main

[]

(let [titles (extract-titles (get-dom))]

(println titles)))

Here’s the complete file:

(ns cljnoob.core
    (:require [net.cgrand.enlive-html :as html]
              [org.httpkit.client :as http]))

(defn get-dom
  []
  (html/html-snippet
      (:body @(http/get "http://kat.cr/tv/" {:insecure? true}))))

(defn extract-titles
    [dom]
    (map
        (comp first :content) (html/select dom [:a.cellMainLink])))

(defn -main
    []
    (let [titles (extract-titles (get-dom))]
        (println titles)))

(ns cljnoob.core

(:require [net.cgrand.enlive-html :as html]

[org.httpkit.client :as http]))

(defn get-dom

[]

(html/html-snippet

(:body @(http/get "http://kat.cr/tv/" {:insecure? true}))))

(defn extract-titles

[dom]

(map

(comp first :content) (html/select dom [:a.cellMainLink])))

(defn -main

[]

(let [titles (extract-titles (get-dom))]

(println titles)))

The code is under 20 lines! 😀

Recent Posts

Recent Comments

Archives

Categories