- Testing with Scrapers
- Processing CAPTCHA
- Scraping Form based Websites
- Scraping Dynamic Websites
- Dealing with Text
- Processing Images and Videos
- Data Processing
- Data Extraction
- Legality of Web Scraping
- Python Modules for Web Scraping
- Getting Started with Python
- Introduction
- Python Web Scraping - Home
Python Web Scraping Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Python Web Scraping - Processing CAPTCHA
In this chapter, let us understand how to perform web scraping and processing CAPTCHA that is used for testing a user for human or robot.
What is CAPTCHA?
The full form of CAPTCHA is Completely Automated Pubpc Turing test to tell Computers and Humans Apart, which clearly suggests that it is a test to determine whether the user is human or not.
A CAPTCHA is a distorted image which is usually not easy to detect by computer program but a human can somehow manage to understand it. Most of the websites use CAPTCHA to prevent bots from interacting.
Loading CAPTCHA with Python
Suppose we want to do registration on a website and there is form with CAPTCHA, then before loading the CAPTCHA image we need to know about the specific information required by the form. With the help of next Python script we can understand the form requirements of registration form on website named
import lxml.html import urlpb.request as urlpb2 import pprint import http.cookiejar as cookiepb def form_parsing(html): tree = lxml.html.fromstring(html) data = {} for e in tree.cssselect( form input ): if e.get( name ): data[e.get( name )] = e.get( value ) return data REGISTER_URL = <a target="_blank" rel="nofollow" href="http://example.webscraping.com/user/register">http://example.webscraping.com/user/register </a> ckj = cookiepb.CookieJar() browser = urlpb2.build_opener(urlpb2.HTTPCookieProcessor(ckj)) html = browser.open( <a target="_blank" rel="nofollow" href="http://example.webscraping.com/places/default/user/register?_next"> http://example.webscraping.com/places/default/user/register?_next</a> = /places/default/index ).read() form = form_parsing(html) pprint.pprint(form)
In the above Python script, first we defined a function that will parse the form by using lxml python module and then it will print the form requirements as follows −
{ _formkey : 5e306d73-5774-4146-a94e-3541f22c95ab , _formname : register , _next : /places/default/index , email : , first_name : , last_name : , password : , password_two : , recaptcha_response_field : None }
You can check from the above output that all the information except recpatcha_response_field are understandable and straightforward. Now the question arises that how we can handle this complex information and download CAPTCHA. It can be done with the help of pillow Python pbrary as follows;
Pillow Python Package
Pillow is a fork of the Python Image pbrary having useful functions for manipulating images. It can be installed with the help of following command −
pip install pillow
In the next example we will use it for loading the CAPTCHA −
from io import BytesIO import lxml.html from PIL import Image def load_captcha(html): tree = lxml.html.fromstring(html) img_data = tree.cssselect( span#recaptcha img )[0].get( src ) img_data = img_data.partition( , )[-1] binary_img_data = img_data.decode( base64 ) file_pke = BytesIO(binary_img_data) img = Image.open(file_pke) return img
The above python script is using pillow python package and defining a function for loading CAPTCHA image. It must be used with the function named form_parser() that is defined in the previous script for getting information about the registration form. This script will save the CAPTCHA image in a useful format which further can be extracted as string.
OCR: Extracting Text from Image using Python
After loading the CAPTCHA in a useful format, we can extract it with the help of Optical Character Recognition (OCR), a process of extracting text from the images. For this purpose, we are going to use open source Tesseract OCR engine. It can be installed with the help of following command −
pip install pytesseract
Example
Here we will extend the above Python script, which loaded the CAPTCHA by using Pillow Python Package, as follows −
import pytesseract img = get_captcha(html) img.save( captcha_original.png ) gray = img.convert( L ) gray.save( captcha_gray.png ) bw = gray.point(lambda x: 0 if x < 1 else 255, 1 ) bw.save( captcha_thresholded.png )
The above Python script will read the CAPTCHA in black and white mode which would be clear and easy to pass to tesseract as follows −
pytesseract.image_to_string(bw)
After running the above script we will get the CAPTCHA of registration form as the output.
Advertisements