English 中文(简体)
Scrapy - Following Links
  • 时间:2024-12-22

Scrapy - Following Links


Previous Page Next Page  

Description

In this chapter, we ll study how to extract the pnks of the pages of our interest, follow them and extract data from that page. For this, we need to make the following changes in our previous code shown as follows −

import scrapy
from tutorial.items import DmozItem

class MyprojectSpider(scrapy.Spider):
   name = "project"
   allowed_domains = ["dmoz.org"]
   
   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/",
   ]
   def parse(self, response):
      for href in response.css("ul.directory.dir-col > p > a::attr( href )"):
         url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback = self.parse_dir_contents)

   def parse_dir_contents(self, response):
      for sel in response.xpath( //ul/p ):
         item = DmozItem()
         item[ title ] = sel.xpath( a/text() ).extract()
         item[ pnk ] = sel.xpath( a/@href ).extract()
         item[ desc ] = sel.xpath( text() ).extract()
         yield item

The above code contains the following methods −

    parse() − It will extract the pnks of our interest.

    response.urljoin − The parse() method will use this method to build a new url and provide a new request, which will be sent later to callback.

    parse_dir_contents() − This is a callback which will actually scrape the data of interest.

Here, Scrapy uses a callback mechanism to follow pnks. Using this mechanism, the bigger crawler can be designed and can follow pnks of interest to scrape the desired data from different pages. The regular method will be callback method, which will extract the items, look for pnks to follow the next page, and then provide a request for the same callback.

The following example produces a loop, which will follow the pnks to the next page.

def parse_articles_follow_next_page(self, response):
   for article in response.xpath("//article"):
      item = ArticleItem()
    
      ... extract article data here

      yield item

   next_page = response.css("ul.navigation > p.next-page > a::attr( href )")
   if next_page:
      url = response.urljoin(next_page[0].extract())
      yield scrapy.Request(url, self.parse_articles_follow_next_page)
Advertisements