Scrapy Basic Concepts
- Scrapy - Exceptions
- Scrapy - Settings
- Scrapy - Link Extractors
- Scrapy - Requests & Responses
- Scrapy - Feed exports
- Scrapy - Item Pipeline
- Scrapy - Shell
- Scrapy - Item Loaders
- Scrapy - Items
- Scrapy - Selectors
- Scrapy - Spiders
- Scrapy - Command Line Tools
- Scrapy - Environment
- Scrapy - Overview
Scrapy Live Project
- Scrapy - Scraped Data
- Scrapy - Following Links
- Scrapy - Using an Item
- Scrapy - Extracting Items
- Scrapy - Crawling
- Scrapy - First Spider
- Scrapy - Define an Item
- Scrapy - Create a Project
Scrapy Built In Services
- Scrapy - Web Services
- Scrapy - Telnet Console
- Scrapy - Sending an E-mail
- Scrapy - Stats Collection
- Scrapy - Logging
Scrapy Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Scrapy - Following Links
Description
In this chapter, we ll study how to extract the pnks of the pages of our interest, follow them and extract data from that page. For this, we need to make the following changes in our
shown as follows −import scrapy from tutorial.items import DmozItem class MyprojectSpider(scrapy.Spider): name = "project" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/", ] def parse(self, response): for href in response.css("ul.directory.dir-col > p > a::attr( href )"): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback = self.parse_dir_contents) def parse_dir_contents(self, response): for sel in response.xpath( //ul/p ): item = DmozItem() item[ title ] = sel.xpath( a/text() ).extract() item[ pnk ] = sel.xpath( a/@href ).extract() item[ desc ] = sel.xpath( text() ).extract() yield item
The above code contains the following methods −
parse() − It will extract the pnks of our interest.
response.urljoin − The parse() method will use this method to build a new url and provide a new request, which will be sent later to callback.
parse_dir_contents() − This is a callback which will actually scrape the data of interest.
Here, Scrapy uses a callback mechanism to follow pnks. Using this mechanism, the bigger crawler can be designed and can follow pnks of interest to scrape the desired data from different pages. The regular method will be callback method, which will extract the items, look for pnks to follow the next page, and then provide a request for the same callback.
The following example produces a loop, which will follow the pnks to the next page.
def parse_articles_follow_next_page(self, response): for article in response.xpath("//article"): item = ArticleItem() ... extract article data here yield item next_page = response.css("ul.navigation > p.next-page > a::attr( href )") if next_page: url = response.urljoin(next_page[0].extract()) yield scrapy.Request(url, self.parse_articles_follow_next_page)Advertisements