Twitter allows us to download our Tweets from the account settings page. Once we request our archive, Twitter will take some time to prepare it and send us an email once this is ready. We will get a download link in the email. After unpacking the archive, we shall find a csv file that contains our tweets – tweets.csv
. The archive also contains a html page (index.html
) that displays our tweets on a nice UI. While this is nice to look at, our primary objective is to extract the links from our tweets.
If we look at the CSV file closely, we shall find a field named expanded_urls
which generally contains the urls we use in our tweets. We will work with the values in this field. With the url, we also want to fetch their title. For this we will use Python 3 (I am using 3.5) and we need the requests
and beautifulsoup4
packages to download and parse the pages. Let’s install them:
1 2 |
pip install requests pip install beautifulsoup4 |
We will follow these steps to extract links and their page titles from the tweets:
- Open the csv file and read row by row
- Each row contains a tweet, we take the
expanded_urls
field - This field can contain multiple urls, separated by a comma. We need to iterate over them all
- We will skip some domains, for example, we don’t want to visit links to twitter status updates
- We fetch the html content using the
requests
library. If the page doesn’t return a HTTP 200, we ignore the response - We extract the title using beautiful soup and display it
Now let’s convert these steps to codes. Here’s the final script I came up with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
import csv import requests from bs4 import BeautifulSoup DOMAINS_TO_SKIP = ['twitter.com'] with open('tweets.csv', 'r') as csvfile: reader = csv.DictReader(csvfile) # each row is a tweet for row in reader: url_string = row.get('expanded_urls') urls = url_string.split(",") for url in urls: # Skip the domains skip = False for domain in DOMAINS_TO_SKIP: if domain in url: skip = True break # fetch the title if url and not skip: print("Crawling: {}".format(url)) resp = requests.get(url) if resp.status_code == 200: soup = BeautifulSoup(resp.content, "html.parser") if soup.title: print("Title: {}".format(soup.title.string)) |
I am actually using this for a personal project I am doing here – https://github.com/masnun/bookmarks – it’s basically a bare bone django admin app where I intend to store the links I visit/share. I come across a lot of interesting projects, articles, videos and then later lose track of them. Hope this app will remedy that. This piece of code is part of a twitter import functionality of the mentioned app.