Do you want to scrape links?
The module urllib2 can be used to download webpage data. Webpage data is always formatted in HTML format.
To cope with the HTML format data, we use a Python module named BeautifulSoup.
BeautifulSoup is a Python module for parsing webpages (HTML).
All of the links will be returned as a list, like so:
[‘//slashdot.org/faq/slashmeta.shtml’, … ,’mailto:firstname.lastname@example.org‘, ‘#’, ‘//slashdot.org/blog’, ‘#’, ‘#’, ‘//slashdot.org’]
We scrape a webpage with these steps:
- download webpage data (html)
- create beautifulsoup object and parse webpage data
- use soups method findAll to find all links by the a tag
- store all links in list
To get all links from a webpage:
This line downloads the webpage data (which is surrounded by HTML tags):
The next line loads it into a BeautifulSoup object:
The link codeblock will then get all links using .findAll(‘a’), where ‘a’ is the indicator for links in html.
Finally we show the list of links: