Building A Proxy Scraper with 15 lines of Python

Yes, Python is great! It’s beautiful and so on…. I have described the power of Python many times. For now, just the codes :) Here’s a proxy scraper I built a few moments ago. It scrapes the web page at proxy-hunter.blogspot.com and lists the available open proxies.

#!/usr/bin/env python
 
from BeautifulSoup import BeautifulSoup as Soup
import re, urllib
 
url = 'http://proxy-hunter.blogspot.com/2010/03/18-03-10-speed-l1-hunter-proxies-310.html'
document = urllib.urlopen(url)
tree = Soup(document.read())
regex  = re.compile(r'^(\d{3}).(\d{1,3}).(\d{1,3}).(\d{1,3}):(\d{2,4})')
proxylist = tree.findAll(attrs = {"class":"Apple-style-span", "style": "color: black;"}, text = regex)
data = proxylist[0]
for x in data.split('\n'):
        print x

It uses the BeautifulSoup package for parsing HTML. On ubuntu install it with this command:

sudo apt-get install python-beautifulsoup

On other platforms, grab the package from its homepage. Google is there to find the URL for you ;-)

This entry was posted in Blog Post and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">