Web Scraping with Pandas and Beautifulsoup

APIs are not always available. Sometimes you have to scrape data from a webpage yourself. Luckily the modules Pandas and Beautifulsoup can help!

Web scraping

Pandas has a neat concept known as a DataFrame. A DataFrame can hold data and be easily manipulated. We can combine Pandas with Beautifulsoup to quickly get data from a webpage.

If you find a table on the web like this:

We can convert it to JSON with:

 
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))

And in a browser get the beautiful json output:

Converting to lists

Rows can be converted to Python lists.
We can convert it to a dataframe using just a few lines:

 
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))[0]
countries = df["COUNTRY"].tolist()
users = df["AMOUNT"].tolist()

Pretty print pandas dataframe

You can convert it to an ascii table with the module tabulate.
This code will instantly convert the table on the web to an ascii table:

 
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
print( tabulate(df[0], headers='keys', tablefmt='psql') )

This will show in the terminal as:

Download web scraping examples

import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

# get html using selenium webdriver, supports javascript
browser = webdriver.PhantomJS()
browser.get( "https://azure.microsoft.com/en-us/pricing/details/storage/blobs/" )
html_source = browser.page_source

# parse
soup = BeautifulSoup(html_source,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))