English 中文(简体)
Scrapy - First Spider
  • 时间:2024-09-17

Scrapy - First Spider


Previous Page Next Page  

Description

Spider is a class that defines initial URL to extract the data from, how to follow pagination pnks and how to extract and parse the fields defined in the items.py. Scrapy provides different types of spiders each of which gives a specific purpose.

Create a file called "first_spider.py" under the first_scrapy/spiders directory, where we can tell Scrapy how to find the exact data we re looking for. For this, you must define some attributes −

    name − It defines the unique name for the spider.

    allowed_domains − It contains the base URLs for the spider to crawl.

    start-urls − A pst of URLs from where the spider starts crawpng.

    parse() − It is a method that extracts and parses the scraped data.

The following code demonstrates how a spider code looks pke −

import scrapy  

class firstSpider(scrapy.Spider): 
   name = "first" 
   allowed_domains = ["dmoz.org"] 
   
   start_urls = [ 
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
   ]  
   def parse(self, response): 
      filename = response.url.sppt("/")[-2] +  .html  
      with open(filename,  wb ) as f: 
         f.write(response.body)
Advertisements