Categories
Python

HTML and XML Parsing in Python using BeautifulSoup :)

I have been looking for a good library in Python for handling HTML and XML. I knew about BeautifulSoup but never cared about it much. But this time, when I was looking for a way to scrape web sites and harvest links using Python, I came across a nice tutorial that demonstrated the wonderful use of the “BeautifulSoup” module. I was amazed and decided to try it all by myself. It’s just amazing ! In fact it’s a shame if you have worked with Python but haven’t used this module. I later came to know that BeautifulSoup is very well known in the Python World.

How to use it?

First Download it and extract the archive. Then install the package. To install it, use the following command:

Once you install it, you can try if everything is okay by typing this code into the interactive python shell:

If you don’t get any error, then everything went fine and now we can start using the module.

The BeautifulSoup module has two prominent object definitions — BeautifulSoup and BeautifulStoneSoup. We use the first one for HTML parsing and the second one for XML parsing.

So, what are we waiting for ? Let’s dive deeper…

I have a webserver running and the URL at http://localhost/ holds the PHP Info page ( the output of phpinfo(); function of php ). We will play with that beautiful page in this session.

Lets fetch the HTML of that page first. We will use urllib.urlopen() to fetch the page and then the returned object to construct a beutiful soup object.

That’s it — we now have a soup object that we can use to browse the HTML document.

There are a lot of features of BeautifulSoup. But I am going to demonstrate how to find a special tag and extract the data inside.

From looking at the HTML source of the page, I know that the <td> that has class=”e” contains data about php settings. To find all tags that has certain attribute with a fixed value, we use the : findAll() method in this way:

That is, we pass the tag name and a dictionary of it’s attributes and their values to the findAll() method to get a list back with the results. Please remember that, the dictionary we pass to the findAll() method with the attributes should be named “attrs” otherwise it’d not work. It’s because that’s an optional parameter or so called **kwargs that is keyword arguments as key-value pairs. And to define them, we always need to declare the parameter name explicitly in the function definition.

So, after that we store the list as “list”. The len() function is a built-in function that we used to count the number of elements. Yes, there are 382 tags those match our query.

You can explicitly extract any single tag by typing that in the following method:

You can get the string inside a tag using the “string” attribute in this way:

Yeah, BeautifulSoup converts strings into Unicode by default. You can override this behaviour. But I am not going to cover that.

Remember, if one tag is nested in another, the parent tag might not return a string if you use the above method.

For more details about this super cool module, please read their documentation. It’s very user friendly, easy-to-understand and of course extremely informative.

I really love BeautifulSoup and Python ! 🙂

Categories
PHP

PHP Multi Threading :)

Though not available under apache for processing web pages, we do have multi-threading in PHP 🙂

The “pcntl” extension enables multi threading in PHP command line interpreter. That is you can work with php multi threading only from command line. Here’s a code snippet:

The above example first calls the pcntl_fork() function that creates another process with the same data. That is the current execution data is transferred into a new process. Both the processes will advance in the same way. We will have to differentiate the two processes from this point and assign two different tasks to them.

We differentiate the processes by using the return value of the pcntl_fork() function. If the function is successful in creating a new process, it will return two values — one for each of the processes. And a single value if failed. As you can imagine, if it fails, we will have that single process which started at the beginning. On success, we have two processes running at the same time. Both executes the same php script. But the return value of the pcntl_fork() function varies. So, we should add some code to the script that determines the process which is executing the script and act likewise.

The return value of the function could be of three types:
— A Process ID
— 0 (Zero)
— (-1) (Negative One)

If the process that’s executing the script is the child process, it gets the return value 0. And the parent process gets the process ID of the child. (-1) means the forking failed.

On the above example, we have used posix_getpid() and posix_getppid() functions to retrieve the process ID of that process and it’s parent’s.

The process of forking is a bit tough and it took an hour of total wilderness to understand how it really worked. And to be honest, the PHP manual is not that helpful regarding this extension if I compare to other PHP functionalities.

Categories
PHP Python

Creating Dictonaries from Arrays / Lists

PHP:

In PHP, we can create a new associative array or dictionary by using the array_combine() function. Just feed two arrays as the parameters to this function and you get back a dictionary. The first array elements are converted into keys and the second array elements are used as values.

Python:

For Python, we first have to zip() two lists into another lists of tuples containing one element from both lists. Then we use the dict() call on this newly created list to create a dictionary.

I elaborated the process in the above example for beginners. But you’d often see advanced python programmers do it like this: