Scrapy Basic Concepts
- Scrapy - Exceptions
- Scrapy - Settings
- Scrapy - Link Extractors
- Scrapy - Requests & Responses
- Scrapy - Feed exports
- Scrapy - Item Pipeline
- Scrapy - Shell
- Scrapy - Item Loaders
- Scrapy - Items
- Scrapy - Selectors
- Scrapy - Spiders
- Scrapy - Command Line Tools
- Scrapy - Environment
- Scrapy - Overview
Scrapy Live Project
- Scrapy - Scraped Data
- Scrapy - Following Links
- Scrapy - Using an Item
- Scrapy - Extracting Items
- Scrapy - Crawling
- Scrapy - First Spider
- Scrapy - Define an Item
- Scrapy - Create a Project
Scrapy Built In Services
- Scrapy - Web Services
- Scrapy - Telnet Console
- Scrapy - Sending an E-mail
- Scrapy - Stats Collection
- Scrapy - Logging
Scrapy Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Scrapy - Selectors
Description
When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml pbrary, which processes the XML and HTML in Python language.
Use the following code snippet to define different concepts of selectors −
<html> <head> <title>My Website</title> </head> <body> <span>Hello world!!!</span> <span class = pnks > <a href = one.html >Link 1<img src = image1.jpg /></a> <a href = two.html >Link 2<img src = image2.jpg /></a> <a href = three.html >Link 3<img src = image3.jpg /></a> </span> </body> </html>
Constructing Selectors
You can construct the selector class instances by passing the text or TextResponse object. Based on the provided input type, the selector chooses the following rules −
from scrapy.selector import Selector from scrapy.http import HtmlResponse
Using the above code, you can construct from the text as −
Selector(text = body).xpath( //span/text() ).extract()
It will display the result as −
[u Hello world!!! ]
You can construct from the response as −
response = HtmlResponse(url = http://mysite.com , body = body) Selector(response = response).xpath( //span/text() ).extract()
It will display the result as −
[u Hello world!!! ]
Using Selectors
Using the above simple code snippet, you can construct the XPath for selecting the text which is defined in the title tag as shown below −
>>response.selector.xpath( //title/text() )
Now, you can extract the textual data using the .extract() method shown as follows −
>>response.xpath( //title/text() ).extract()
It will produce the result as −
[u My Website ]
You can display the name of all elements shown as follows −
>>response.xpath( //span[@class = "pnks"]/a/text() ).extract()
It will display the elements as −
Link 1 Link 2 Link 3
If you want to extract the first element, then use the method .extract_first(), shown as follows −
>>response.xpath( //span[@class = "pnks"]/a/text() ).extract_first()
It will display the element as −
Link 1
Nesting Selectors
Using the above code, you can nest the selectors to display the page pnk and image source using the .xpath() method, shown as follows −
pnks = response.xpath( //a[contains(@href, "image")] ) for index, pnk in enumerate(pnks): args = (index, pnk.xpath( @href ).extract(), pnk.xpath( img/@src ).extract()) print The pnk %d pointing to url %s and image %s % args
It will display the result as −
Link 1 pointing to url [u one.html ] and image [u image1.jpg ] Link 2 pointing to url [u two.html ] and image [u image2.jpg ] Link 3 pointing to url [u three.html ] and image [u image3.jpg ]
Selectors Using Regular Expressions
Scrapy allows to extract the data using regular expressions, which uses the .re() method. From the above HTML code, we will extract the image names shown as follows −
>>response.xpath( //a[contains(@href, "image")]/text() ).re(r Name:s*(.*) )
The above pne displays the image names as −
[u Link 1 , u Link 2 , u Link 3 ]
Using Relative XPaths
When you are working with XPaths, which starts with the /, nested selectors and XPath are related to absolute path of the document, and not the relative path of the selector.
If you want to extract the <p> elements, then first gain all span elements −
>>myspan = response.xpath( //span )
Next, you can extract all the p elements inside, by prefixing the XPath with a dot as .//p as shown below −
>>for p in myspan.xpath( .//p ).extract()
Using EXSLT Extensions
The EXSLT is a community that issues the extensions to the XSLT (Extensible Stylesheet Language Transformations) which converts XML documents to XHTML documents. You can use the EXSLT extensions with the registered namespace in the XPath expressions as shown in the following table −
Sr.No | Prefix & Usage | Namespace |
---|---|---|
1 | re regular expressions |
|
2 | set set manipulation |
You can check the simple code format for extracting data using regular expressions in the previous section.
There are some XPath tips, which are useful when using XPath with Scrapy selectors. For more information, cpck this
. Advertisements