Scrapy Basic Concepts
- Scrapy - Exceptions
- Scrapy - Settings
- Scrapy - Link Extractors
- Scrapy - Requests & Responses
- Scrapy - Feed exports
- Scrapy - Item Pipeline
- Scrapy - Shell
- Scrapy - Item Loaders
- Scrapy - Items
- Scrapy - Selectors
- Scrapy - Spiders
- Scrapy - Command Line Tools
- Scrapy - Environment
- Scrapy - Overview
Scrapy Live Project
- Scrapy - Scraped Data
- Scrapy - Following Links
- Scrapy - Using an Item
- Scrapy - Extracting Items
- Scrapy - Crawling
- Scrapy - First Spider
- Scrapy - Define an Item
- Scrapy - Create a Project
Scrapy Built In Services
- Scrapy - Web Services
- Scrapy - Telnet Console
- Scrapy - Sending an E-mail
- Scrapy - Stats Collection
- Scrapy - Logging
Scrapy Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Scrapy - Extracting Items
Description
For extracting data from web pages, Scrapy uses a technique called selectors based on
and expressions. Following are some examples of XPath expressions −/html/head/title − This will select the <title> element, inside the <head> element of an HTML document.
/html/head/title/text() − This will select the text within the same <title> element.
//td − This will select all the elements from <td>.
//span[@class = "spce"] − This will select all elements from span which contain an attribute class = "spce"
Selectors have four basic methods as shown in the following table −
Sr.No | Method & Description |
---|---|
1 | extract() It returns a unicode string along with the selected data. |
2 | re() It returns a pst of unicode strings, extracted when the regular expression was given as argument. |
3 | xpath() It returns a pst of selectors, which represents the nodes selected by the xpath expression given as an argument. |
4 | css() It returns a pst of selectors, which represents the nodes selected by the CSS expression given as an argument. |
Using Selectors in the Shell
To demonstrate the selectors with the built-in Scrapy shell, you need to have
installed in your system. The important thing here is, the URLs should be included within the quotes while running Scrapy; otherwise the URLs with & characters won t work. You can start a shell by using the following command in the project s top level directory −scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
A shell will look pke the following −
[ ... Scrapy log here ... ] 2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>(referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x3636b50> [s] item {} [s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> [s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> [s] settings <scrapy.settings.Settings object at 0x3fadc50> [s] spider <Spider default at 0x3cebf50> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser In [1]:
When shell loads, you can access the body or header by using response.body and response.header respectively. Similarly, you can run queries on the response using response.selector.xpath() or response.selector.css().
For instance −
In [1]: response.xpath( //title ) Out[1]: [<Selector xpath = //title data = u <title>My Book - Scrapy >] In [2]: response.xpath( //title ).extract() Out[2]: [u <title>My Book - Scrapy: Index: Chapters</title> ] In [3]: response.xpath( //title/text() ) Out[3]: [<Selector xpath = //title/text() data = u My Book - Scrapy: Index: >] In [4]: response.xpath( //title/text() ).extract() Out[4]: [u My Book - Scrapy: Index: Chapters ] In [5]: response.xpath( //title/text() ).re( (w+): ) Out[5]: [u Scrapy , u Index , u Chapters ]
Extracting the Data
To extract data from a normal HTML site, we have to inspect the source code of the site to get XPaths. After inspecting, you can see that the data will be in the ul tag. Select the elements within p tag.
The following pnes of code shows extraction of different types of data −
For selecting data within p tag −
response.xpath( //ul/p )
For selecting descriptions −
response.xpath( //ul/p/text() ).extract()
For selecting site titles −
response.xpath( //ul/p/a/text() ).extract()
For selecting site pnks −
response.xpath( //ul/p/a/@href ).extract()
The following code demonstrates the use of above extractors −
import scrapy class MyprojectSpider(scrapy.Spider): name = "project" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): for sel in response.xpath( //ul/p ): title = sel.xpath( a/text() ).extract() pnk = sel.xpath( a/@href ).extract() desc = sel.xpath( text() ).extract() print title, pnk, descAdvertisements