Python

Tag: beautifulsoup

Web Scraping with Pandas and Beautifulsoup

APIs are not always available. Sometimes you have to scrape data from a webpage yourself. Luckily the modules Pandas and Beautifulsoup can help!

Related Course: Complete Python Bootcamp: Go from zero to hero in Python

Web scraping

Pandas has a neat concept known as a DataFrame. A DataFrame can hold data and be easily manipulated. We can combine Pandas with Beautifulsoup to quickly get data from a webpage.

If you find a table on the web like this:

world internet users

We can convert it to JSON with:

 
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))

And in a browser get the beautiful json output:
pandas to json

 

Converting to lists

Rows can be converted to Python lists.
We can convert it to a dataframe using just a few lines:

 
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))[0]
countries = df["COUNTRY"].tolist()
users = df["AMOUNT"].tolist()

Pretty print pandas dataframe

You can convert it to an ascii table with the module tabulate.
This code will instantly convert the table on the web to an ascii table:

 
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print( tabulate(df[0], headers='keys', tablefmt='psql') )


This will show in the terminal as:
pretty print panda dataframe

Download web scraping examples