- Testing with Scrapers
- Processing CAPTCHA
- Scraping Form based Websites
- Scraping Dynamic Websites
- Dealing with Text
- Processing Images and Videos
- Data Processing
- Data Extraction
- Legality of Web Scraping
- Python Modules for Web Scraping
- Getting Started with Python
- Introduction
- Python Web Scraping - Home
Python Web Scraping Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Python Modules for Web Scraping
In this chapter, let us learn various Python modules that we can use for web scraping.
Python Development Environments using virtualenv
Virtualenv is a tool to create isolated Python environments. With the help of virtualenv, we can create a folder that contains all necessary executables to use the packages that our Python project requires. It also allows us to add and modify Python modules without access to the global installation.
You can use the following command to install virtualenv −
(base) D:ProgramData>pip install virtualenv Collecting virtualenv Downloading https://files.pythonhosted.org/packages/b6/30/96a02b2287098b23b875bc8c2f58071c3 5d2efe84f747b64d523721dc2b5/virtualenv-16.0.0-py2.py3-none-any.whl (1.9MB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1.9MB 86kB/s Instalpng collected packages: virtualenv Successfully installed virtualenv-16.0.0
Now, we need to create a directory which will represent the project with the help of following command −
(base) D:ProgramData>mkdir webscrap
Now, enter into that directory with the help of this following command −
(base) D:ProgramData>cd webscrap
Now, we need to initiapze virtual environment folder of our choice as follows −
(base) D:ProgramDatawebscrap>virtualenv websc Using base prefix d:\programdata New python executable in D:ProgramDatawebscrapwebscScriptspython.exe Instalpng setuptools, pip, wheel...done.
Now, activate the virtual environment with the command given below. Once successfully activated, you will see the name of it on the left hand side in brackets.
(base) D:ProgramDatawebscrap>webscscriptsactivate
We can install any module in this environment as follows −
(websc) (base) D:ProgramDatawebscrap>pip install requests Collecting requests Downloading https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69 c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl (9 1kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 92kB 148kB/s Collecting chardet<3.1.0,>=3.0.2 (from requests) Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca 55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133 kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 369kB/s Collecting certifi>=2017.4.17 (from requests) Downloading https://files.pythonhosted.org/packages/df/f7/04fee6ac349e915b82171f8e23cee6364 4d83663b34c539f7a09aed18f9e/certifi-2018.8.24-py2.py3-none-any.whl (147kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 153kB 527kB/s Collecting urlpb3<1.24,>=1.21.1 (from requests) Downloading https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5 3851ef4f56f62a3486e6a7d8ffb/urlpb3-1.23-py2.py3-none-any.whl (133k B) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 517kB/s Collecting idna<2.8,>=2.5 (from requests) Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746 a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 61kB 339kB/s Instalpng collected packages: chardet, certifi, urlpb3, idna, requests Successfully installed certifi-2018.8.24 chardet-3.0.4 idna-2.7 requests-2.19.1 urlpb3-1.23
For deactivating the virtual environment, we can use the following command −
(websc) (base) D:ProgramDatawebscrap>deactivate (base) D:ProgramDatawebscrap>
You can see that (websc) has been deactivated.
Python Modules for Web Scraping
Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.
In this section, we are going to discuss about useful Python pbraries for web scraping.
Requests
It is a simple python web scraping pbrary. It is an efficient HTTP pbrary used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before using requests, let us understand its installation.
Instalpng Requests
We can install it in either on our virtual environment or on the global installation. With the help of pip command, we can easily install it as follows −
(base) D:ProgramData> pip install requests Collecting requests Using cached https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69 c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl Requirement already satisfied: idna<2.8,>=2.5 in d:programdatapbsitepackages (from requests) (2.6) Requirement already satisfied: urlpb3<1.24,>=1.21.1 in d:programdatapbsite-packages (from requests) (1.22) Requirement already satisfied: certifi>=2017.4.17 in d:programdatapbsitepackages (from requests) (2018.1.18) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in d:programdatapbsite-packages (from requests) (3.0.4) Instalpng collected packages: requests Successfully installed requests-2.19.1
Example
In this example, we are making a GET HTTP request for a web page. For this we need to first import requests pbrary as follows −
In [1]: import requests
In this following pne of code, we use requests to make a GET HTTP requests for the url:
by making a GET request.In [2]: r = requests.get( https://authoraditiagarwal.com/ )
Now we can retrieve the content by using .text property as follows −
In [5]: r.text[:200]
Observe that in the following output, we got the first 200 characters.
Out[5]: <!DOCTYPE html> <html lang="en-US" itemscope itemtype="http://schema.org/WebSite" prefix="og: http://ogp.me/ns#" > <head> <meta charset ="UTF-8" /> <meta http-equiv="X-UA-Compatible" content="IE
Urlpb3
It is another Python pbrary that can be used for retrieving data from URLs similar to the requests pbrary. You can read more on this at its technical documentation at
.Instalpng Urlpb3
Using the pip command, we can install urlpb3 either in our virtual environment or in global installation.
(base) D:ProgramData>pip install urlpb3 Collecting urlpb3 Using cached https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5 3851ef4f56f62a3486e6a7d8ffb/urlpb3-1.23-py2.py3-none-any.whl Instalpng collected packages: urlpb3 Successfully installed urlpb3-1.23
Example: Scraping using Urlpb3 and BeautifulSoup
In the following example, we are scraping the web page by using Urlpb3 and BeautifulSoup. We are using Urlpb3 at the place of requests pbrary for getting the raw data (HTML) from web page. Then we are using BeautifulSoup for parsing that HTML data.
import urlpb3 from bs4 import BeautifulSoup http = urlpb3.PoolManager() r = http.request( GET , https://authoraditiagarwal.com ) soup = BeautifulSoup(r.data, lxml ) print (soup.title) print (soup.title.text)
This is the output you will observe when you run this code −
<title>Learn and Grow with Aditi Agarwal</title> Learn and Grow with Aditi Agarwal
Selenium
It is an open source automated testing suite for web apppcations across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we are going to perform web scraping by using selenium and its Python bindings. You can learn more about Selenium with Java on the pnk
.Selenium Python bindings provide a convenient API to access Selenium WebDrivers pke Firefox, IE, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above.
Instalpng Selenium
Using the pip command, we can install urlpb3 either in our virtual environment or in global installation.
pip install selenium
As selenium requires a driver to interface with the chosen browser, we need to download it. The following table shows different browsers and their pnks for downloading the same.
Chrome |
|
Edge |
|
Firefox |
|
Safari |
Example
This example shows web scraping using selenium. It can also be used for testing which is called selenium testing.
After downloading the particular driver for the specified version of browser, we need to do programming in Python.
First, need to import webdriver from selenium as follows −
from selenium import webdriver
Now, provide the path of web driver which we have downloaded as per our requirement −
path = r C:\Users\gaurav\Desktop\Chromedriver browser = webdriver.Chrome(executable_path = path)
Now, provide the url which we want to open in that web browser now controlled by our Python script.
browser.get()
We can also scrape a particular element by providing the xpath as provided in lxml.
browser.find_element_by_xpath( /html/body ).cpck()
You can check the browser, controlled by Python script, for output.
Scrapy
Scrapy is a fast, open-source web crawpng framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 pcensed under BSD, with a milestone 1.0 releasing in June 2015. It provides us all the tools we need to extract, process and structure the data from websites.
Instalpng Scrapy
Using the pip command, we can install urlpb3 either in our virtual environment or in global installation.
pip install scrapy
For more detail study of Scrapy you can go to the pnk
Advertisements