Get links from webpage

Do you want to scrape links?

The module urllib2 can be used to download webpage data. Webpage data is always formatted in HTML format.

To cope with the HTML format data, we use a Python module named BeautifulSoup.

BeautifulSoup is a Python module for parsing webpages (HTML).

 
Related: Web scraping with Pandas and Beautifulsoup

Get all links from a webpage

All of the links will be returned as a list, like so:

[‘//slashdot.org/faq/slashmeta.shtml’, … ,’mailto:[email protected]’, ‘#’, ‘//slashdot.org/blog’, ‘#’, ‘#’, ‘//slashdot.org’]

We scrape a webpage with these steps:

  • download webpage data (html)
  • create beautifulsoup object and parse webpage data
  • use soups method findAll to find all links by the a tag
  • store all links in list

To get all links from a webpage:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("http://slashdot.org")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

print(links)

How does it work?

This line downloads the webpage data (which is surrounded by HTML tags):

req = Request("http://slashdot.org")
html_page = urlopen(req)

The next line loads it into a BeautifulSoup object:

soup = BeautifulSoup(html_page, "lxml")

The link codeblock will then get all links using .findAll(‘a’), where ‘a’ is the indicator for links in html.

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

Finally we show the list of links:

print(links)

Related: Web scraping with Pandas and Beautifulsoup

Previous Post

2 Replies to “Get links from webpage”

  1. Thanks for the super informative content.
    Just to save some time from new users, I just spend a couple of minutes trying to get BeautifulSoup working.
    The new version apparently have some bugs on it and the correct syntax at the beginning is:
    from bs4 import BeautifulSoup.

Leave a Reply

Your email address will not be published. Required fields are marked *