Web Scraping using Selenium and BeautifulSoup libraries in python

Scraping Data from various websites

Photo by Joshua Sortino on Unsplash

Data is the fuel for running the Data Science life cycle and so web scraping becomes one of the most important skill needed to become a good data scientist.

In delivering a data science project a data scientist spend 19% of his time in data collection from various sources .

Data from the websites can gives a peak in the project with data from customers, products, people, stock markets available on websites.

Instead of looking at the site every day, you can use Python to help automate the repetitive parts. Automating web scraping can be a way out to speed up the data collection process.

Python Web Scraping Tools For Data Scientists

  • Beautiful Soup.
  • Scrapy.
  • Selenium.
  • Python Requests.
  • LXML.
  • MechanicalSoup.
  • Urllib

Beautiful Soup, scrapy, Selenium and python Requests is one of the most commonly used library in python for web extracton.

In this article, we will be focusing on Beautiful Soup and Selenium.

Beautiful Soup

Beautiful Soup is a Python package for parsing HTML and XML documents.It is available for Python 2.7 and Python 3.

BeautifulSoup, and is named after so-called ‘tag soup’, which refers to “syntactically or structurally incorrect HTML written for a web page”

Installing BeautifulSoup

Python package manager pip or the anaconda package manager can be used to install BeautifulSoup. Open the Terminal and run this command-

Importing necessary libraries

In the code editor import the python library BeautifulSoup

Import BeautifulSoup and requests with the following command

The requests module allows you to send HTTP requests using Python.

Inspect a website

In this post we will scrape the “content” and “see also” sections from an arbitrary Wikipedia article.

Select the item on the website which you want to scrape and press Ctrl+Shift+I or right clink and select Inspect. Find the tags and attributes related to the item. Tags can be same for many but the class name, id ,title any one of these will be different for every element.

Parsing the HTML

Now we need the url of the website here Wikipedia. For connecting to the website and getting the html we will use ‘‘‘requests ’’’.

Now the variable data will connect to the website and it will return requests.models.Response data type.

The data object needs to be passed to BeautifulSoup to parse the html

Find the tags and attributes related

Links

To find the relevant link we want to from the Wikipedia website. Here consider the links in orange bracket-