Building A Proxy Scraper with 15 lines of Python

Yes, Python is great! It’s beautiful and so on…. I have described the power of Python many times. For now, just the codes 🙂 Here’s a proxy scraper I built a few moments ago. It scrapes the web page at proxy-hunter.blogspot.com and lists the available open proxies.

#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup as Soup
import re, urllib

url = 'http://proxy-hunter.blogspot.com/2010/03/18-03-10-speed-l1-hunter-proxies-310.html'
document = urllib.urlopen(url)
tree = Soup(document.read())
regex  = re.compile(r'^(\d{3}).(\d{1,3}).(\d{1,3}).(\d{1,3}):(\d{2,4})')
proxylist = tree.findAll(attrs = {"class":"Apple-style-span", "style": "color: black;"}, text = regex)
data = proxylist[0]
for x in data.split('\n'):
        print x

#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup as Soup

import re, urllib

url = 'http://proxy-hunter.blogspot.com/2010/03/18-03-10-speed-l1-hunter-proxies-310.html'

document = urllib.urlopen(url)

tree = Soup(document.read())

regex = re.compile(r'^(\d{3}).(\d{1,3}).(\d{1,3}).(\d{1,3}):(\d{2,4})')

proxylist = tree.findAll(attrs = {"class":"Apple-style-span", "style": "color: black;"}, text = regex)

data = proxylist[0]

for x in data.split('\n'):

print x

It uses the BeautifulSoup package for parsing HTML. On ubuntu install it with this command:

sudo apt-get install python-beautifulsoup

1	sudo apt-get install python-beautifulsoup

On other platforms, grab the package from its homepage. Google is there to find the URL for you 😉

2 replies on “Building A Proxy Scraper with 15 lines of Python”

Smart! Perso I use a ProxyHarvester to scrape fresh proxies. Each day I get around 30-40.000 proxies.
If interested: http://www.rapidformfiller.com/proxyharvester

thank you for being straightforward and clear and real — or at least, so far. awesome.

Comments are closed.

Recent Posts

Recent Comments

Archives

Categories