Scrapy Basic Concepts
- Scrapy - Exceptions
- Scrapy - Settings
- Scrapy - Link Extractors
- Scrapy - Requests & Responses
- Scrapy - Feed exports
- Scrapy - Item Pipeline
- Scrapy - Shell
- Scrapy - Item Loaders
- Scrapy - Items
- Scrapy - Selectors
- Scrapy - Spiders
- Scrapy - Command Line Tools
- Scrapy - Environment
- Scrapy - Overview
Scrapy Live Project
- Scrapy - Scraped Data
- Scrapy - Following Links
- Scrapy - Using an Item
- Scrapy - Extracting Items
- Scrapy - Crawling
- Scrapy - First Spider
- Scrapy - Define an Item
- Scrapy - Create a Project
Scrapy Built In Services
- Scrapy - Web Services
- Scrapy - Telnet Console
- Scrapy - Sending an E-mail
- Scrapy - Stats Collection
- Scrapy - Logging
Scrapy Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Scrapy - First Spider
Description
Spider is a class that defines initial URL to extract the data from, how to follow pagination pnks and how to extract and parse the fields defined in the items.py. Scrapy provides different types of spiders each of which gives a specific purpose.
Create a file called "first_spider.py" under the first_scrapy/spiders directory, where we can tell Scrapy how to find the exact data we re looking for. For this, you must define some attributes −
name − It defines the unique name for the spider.
allowed_domains − It contains the base URLs for the spider to crawl.
start-urls − A pst of URLs from where the spider starts crawpng.
parse() − It is a method that extracts and parses the scraped data.
The following code demonstrates how a spider code looks pke −
import scrapy class firstSpider(scrapy.Spider): name = "first" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.sppt("/")[-2] + .html with open(filename, wb ) as f: f.write(response.body)Advertisements