Yes, Python is great! It’s beautiful and so on…. I have described the power of Python many times. For now, just the codes 🙂 Here’s a proxy scraper I built a few moments ago. It scrapes the web page at proxy-hunter.blogspot.com and lists the available open proxies.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
#!/usr/bin/env python from BeautifulSoup import BeautifulSoup as Soup import re, urllib url = 'http://proxy-hunter.blogspot.com/2010/03/18-03-10-speed-l1-hunter-proxies-310.html' document = urllib.urlopen(url) tree = Soup(document.read()) regex = re.compile(r'^(\d{3}).(\d{1,3}).(\d{1,3}).(\d{1,3}):(\d{2,4})') proxylist = tree.findAll(attrs = {"class":"Apple-style-span", "style": "color: black;"}, text = regex) data = proxylist[0] for x in data.split('\n'): print x |
It uses the BeautifulSoup package for parsing HTML. On ubuntu install it with this command:
1 |
sudo apt-get install python-beautifulsoup |
On other platforms, grab the package from its homepage. Google is there to find the URL for you 😉
2 replies on “Building A Proxy Scraper with 15 lines of Python”
Smart! Perso I use a ProxyHarvester to scrape fresh proxies. Each day I get around 30-40.000 proxies.
If interested: http://www.rapidformfiller.com/proxyharvester
thank you for being straightforward and clear and real — or at least, so far. awesome.