Get links from webpage

The module urllib2 can be used to download webpage data. Webpage data is always formatted in HTML format. To cope with the HTML format data, we use a Python module named BeautifulSoup.

Get all links from a webpage
To get all links from a webpage:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://slashdot.org")
soup = BeautifulSoup(html_page)
links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

print(links)

Explanation
This line downloads the webpage data (which is surrounded by HTML tags):

html_page = urllib2.urlopen("http://slashdot.org")

The next line loads it into a BeautifulSoup object:

soup = BeautifulSoup(html_page)

The link codeblock will then get all links using .findAll(‘a’), where ‘a’ is the indicator for links in html.

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

Finally we show the list of links:

print(links)
Socket Server (Network)
Read Gmail using Python

2 thoughts on “Get links from webpage

  1. Reply
    Yves - June 29, 2016

    Thanks for the super informative content.
    Just to save some time from new users, I just spend a couple of minutes trying to get BeautifulSoup working.
    The new version apparently have some bugs on it and the correct syntax at the beginning is:
    from bs4 import BeautifulSoup.

  2. Reply
    Guest - July 22, 2016

Leave a Reply

Your email address will not be published. Required fields are marked *