I have been looking for a good library in Python for handling HTML and XML. I knew about BeautifulSoup but never cared about it much. But this time, when I was looking for a way to scrape web sites and harvest links using Python, I came across a nice tutorial that demonstrated the wonderful use of the “BeautifulSoup” module. I was amazed and decided to try it all by myself. It’s just amazing ! In fact it’s a shame if you have worked with Python but haven’t used this module. I later came to know that BeautifulSoup is very well known in the Python World.
How to use it?
First Download it and extract the archive. Then install the package. To install it, use the following command:
1 |
sudo python setup.py install |
Once you install it, you can try if everything is okay by typing this code into the interactive python shell:
1 |
>>> from BeautifulSoup import BeautifulSoup |
If you don’t get any error, then everything went fine and now we can start using the module.
The BeautifulSoup module has two prominent object definitions — BeautifulSoup and BeautifulStoneSoup. We use the first one for HTML parsing and the second one for XML parsing.
So, what are we waiting for ? Let’s dive deeper…
I have a webserver running and the URL at http://localhost/ holds the PHP Info page ( the output of phpinfo(); function of php ). We will play with that beautiful page in this session.
Lets fetch the HTML of that page first. We will use urllib.urlopen() to fetch the page and then the returned object to construct a beutiful soup object.
1 2 3 4 |
>>> import urllib >>> file = urllib.urlopen("http://localhost/") >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup(file) |
That’s it — we now have a soup object that we can use to browse the HTML document.
There are a lot of features of BeautifulSoup. But I am going to demonstrate how to find a special tag and extract the data inside.
From looking at the HTML source of the page, I know that the <td> that has class=”e” contains data about php settings. To find all tags that has certain attribute with a fixed value, we use the : findAll() method in this way:
1 2 3 4 |
>>> list = soup.findAll('td',attrs={"class":"e"}) >>> len(list) 382 >>> |
That is, we pass the tag name and a dictionary of it’s attributes and their values to the findAll() method to get a list back with the results. Please remember that, the dictionary we pass to the findAll() method with the attributes should be named “attrs” otherwise it’d not work. It’s because that’s an optional parameter or so called **kwargs that is keyword arguments as key-value pairs. And to define them, we always need to declare the parameter name explicitly in the function definition.
So, after that we store the list as “list”. The len() function is a built-in function that we used to count the number of elements. Yes, there are 382 tags those match our query.
You can explicitly extract any single tag by typing that in the following method:
1 2 3 |
>>> soup.h2 <h2>PHP Core</h2> >>> |
1 2 3 |
>>> soup.title <title>phpinfo()</title> >>> |
You can get the string inside a tag using the “string” attribute in this way:
1 2 3 |
>>> soup.title.string u'phpinfo()' >>> |
Yeah, BeautifulSoup converts strings into Unicode by default. You can override this behaviour. But I am not going to cover that.
Remember, if one tag is nested in another, the parent tag might not return a string if you use the above method.
For more details about this super cool module, please read their documentation. It’s very user friendly, easy-to-understand and of course extremely informative.
I really love BeautifulSoup and Python ! 🙂
One reply on “HTML and XML Parsing in Python using BeautifulSoup :)”
Could you please tell me how to get bangla content properly?