Scrapy - First Spider-alljchome-开发者的教程家园

Scrapy Tutorial

Scrapy - Home

Scrapy Basic Concepts

Scrapy Live Project

Scrapy Built In Services

Scrapy Useful Resources

Selected Reading

Scrapy - First Spider

Description

Spider is a class that defines initial URL to extract the data from, how to follow pagination pnks and how to extract and parse the fields defined in the items.py. Scrapy provides different types of spiders each of which gives a specific purpose.

Create a file called "first_spider.py" under the first_scrapy/spiders directory, where we can tell Scrapy how to find the exact data we re looking for. For this, you must define some attributes −

name − It defines the unique name for the spider.

allowed_domains − It contains the base URLs for the spider to crawl.

start-urls − A pst of URLs from where the spider starts crawpng.

parse() − It is a method that extracts and parses the scraped data.

The following code demonstrates how a spider code looks pke −

import scrapy  

class firstSpider(scrapy.Spider): 
   name = "first" 
   allowed_domains = ["dmoz.org"] 
   
   start_urls = [ 
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
   ]  
   def parse(self, response): 
      filename = response.url.sppt("/")[-2] +  .html  
      with open(filename,  wb ) as f: 
         f.write(response.body)

Scrapy - First Spider

Description

友情链接