Web Scraping with Pandas and Beautifulsoup

APIs are not always available. Sometimes you have to scrape data from a webpage yourself. Luckily the modules Pandas and Beautifulsoup can help!

Related Course: 30 Days of Python Unlock your Python Potential

Web scraping

Pandas has a neat concept known as a DataFrame. A DataFrame can hold data and be easily manipulated. We can combine Pandas with Beautifulsoup to quickly get data from a webpage.

If you find a table on the web like this:

world internet users

We can convert it to JSON with:

And in a browser get the beautiful json output:
pandas to json

 

Converting to lists

Rows can be converted to Python lists.
We can convert it to a dataframe using just a few lines:

Pretty print pandas dataframe

You can convert it to an ascii table with the module tabulate.
This code will instantly convert the table on the web to an ascii table:

This will show in the terminal as:
pretty print panda dataframe

Leave a Reply

12 Replies to “Web Scraping with Pandas and Beautifulsoup”

  1. This is very helpful. I am also looking for some way to convert a text/paragraph to table/graph. Eg: market share is 20% in 2016 should produce some pie-chart or plain table. Any leads on approaching this would be helpful

    • The price depends on region (javascript). BeautifulSoup works well with static-html, but we need Javascript support for this data. Selenium works well with Javascript.

      To get the data we could use the selenium module (works with webdriver). We use Chrome driver or PhantomJS to get the raw html data, then process as usual. Install selenium and a webdriver.

      Working code below:

  2. Is it possible to read graphs/bar charts/complex diagrams and create alt text using python? If I’m using a PDF, can we do this? Also if we have standalone images, can we do the same for these? Please share your thoughts/advise…. Thanks

    • Yes, it’s possible to parse diagrams. If they’re on the web you may get the data from json, otherwise you need a computer vision algorithm to parse it. For standalone images you would need to parse it with a vision algorithm.