Data is the fuel for running the Data Science life cycle and so web scraping becomes one of the most important skill needed to become a good data scientist.
In delivering a data science project a data scientist spend 19% of his time in data collection from various sources .
Data from the websites can gives a peak in the project with data from customers, products, people, stock markets available on websites.
Instead of looking at the site every day, you can use Python to help automate the repetitive parts. Automating web scraping can be a way out to speed up the data collection process.
Python Web Scraping Tools For Data Scientists
- Beautiful Soup.
- Python Requests.
Beautiful Soup, scrapy, Selenium and python Requests is one of the most commonly used library in python for web extracton.
In this article, we will be focusing on Beautiful Soup and Selenium.
Beautiful Soup is a Python package for parsing HTML and XML documents.It is available for Python 2.7 and Python 3.
Python package manager
pip or the
anaconda package manager can be used to install BeautifulSoup. Open the Terminal and run this command-
conda install beautifulsoup4pip install BeautifulSoup4
Importing necessary libraries
In the code editor import the python library BeautifulSoup
Import BeautifulSoup and requests with the following command
from bs4 import BeautifulSoup as bs4
The requests module allows you to send HTTP requests using Python.
Inspect a website
In this post we will scrape the “content” and “see also” sections from an arbitrary Wikipedia article.
Select the item on the website which you want to scrape and press Ctrl+Shift+I or right clink and select Inspect. Find the tags and attributes related to the item. Tags can be same for many but the class name, id ,title any one of these will be different for every element.
Parsing the HTML
Now we need the url of the website here Wikipedia. For connecting to the website and getting the html we will use ‘‘‘requests ’’’.
Now the variable data will connect to the website and it will return requests.models.Response data type.
The data object needs to be passed to BeautifulSoup to parse the html
Find the tags and attributes related
To find the relevant link we want to from the Wikipedia website. Here consider the links in orange bracket-