Scrapy Basic Concepts
- Scrapy - Exceptions
- Scrapy - Settings
- Scrapy - Link Extractors
- Scrapy - Requests & Responses
- Scrapy - Feed exports
- Scrapy - Item Pipeline
- Scrapy - Shell
- Scrapy - Item Loaders
- Scrapy - Items
- Scrapy - Selectors
- Scrapy - Spiders
- Scrapy - Command Line Tools
- Scrapy - Environment
- Scrapy - Overview
Scrapy Live Project
- Scrapy - Scraped Data
- Scrapy - Following Links
- Scrapy - Using an Item
- Scrapy - Extracting Items
- Scrapy - Crawling
- Scrapy - First Spider
- Scrapy - Define an Item
- Scrapy - Create a Project
Scrapy Built In Services
- Scrapy - Web Services
- Scrapy - Telnet Console
- Scrapy - Sending an E-mail
- Scrapy - Stats Collection
- Scrapy - Logging
Scrapy Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Scrapy - Spiders
Description
Spider is a class responsible for defining how to follow the pnks through a website and extract the information from the pages.
The default spiders of Scrapy are as follows −
scrapy.Spider
It is a spider from which every other spiders must inherit. It has the following class −
class scrapy.spiders.Spider
The following table shows the fields of scrapy.Spider class −
Sr.No | Field & Description |
---|---|
1 | name It is the name of your spider. |
2 | allowed_domains It is a pst of domains on which the spider crawls. |
3 | start_urls It is a pst of URLs, which will be the roots for later crawls, where the spider will begin to crawl from. |
4 | custom_settings These are the settings, when running the spider, will be overridden from project wide configuration. |
5 | crawler It is an attribute that pnks to Crawler object to which the spider instance is bound. |
6 | settings These are the settings for running a spider. |
7 | logger It is a Python logger used to send log messages. |
8 | from_crawler(crawler,*args,**kwargs) It is a class method, which creates your spider. The parameters are − crawler − A crawler to which the spider instance will be bound. args(pst) − These arguments are passed to the method _init_(). kwargs(dict) − These keyword arguments are passed to the method _init_(). |
9 | start_requests() When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method. |
10 | make_requests_from_url(url) It is a method used to convert urls to requests. |
11 | parse(response) This method processes the response and returns scrapped data following more URLs. |
12 | log(message[,level,component]) It is a method that sends a log message through spiders logger. |
13 | closed(reason) This method is called when the spider closes. |
Spider Arguments
Spider arguments are used to specify start URLs and are passed using crawl command with -a option, shown as follows −
scrapy crawl first_scrapy -a group = accessories
The following code demonstrates how a spider receives arguments −
import scrapy class FirstSpider(scrapy.Spider): name = "first" def __init__(self, group = None, *args, **kwargs): super(FirstSpider, self).__init__(*args, **kwargs) self.start_urls = ["http://www.example.com/group/%s" % group]
Generic Spiders
You can use generic spiders to subclass your spiders from. Their aim is to follow all pnks on the website based on certain rules to extract data from all pages.
For the examples used in the following spiders, let’s assume we have a project with the following fields −
import scrapy from scrapy.item import Item, Field class First_scrapyItem(scrapy.Item): product_title = Field() product_pnk = Field() product_description = Field()
CrawlSpider
CrawlSpider defines a set of rules to follow the pnks and scrap more than one page. It has the following class −
class scrapy.spiders.CrawlSpider
Following are the attributes of CrawlSpider class −
rules
It is a pst of rule objects that defines how the crawler follows the pnk.
The following table shows the rules of CrawlSpider class −
Sr.No | Rule & Description |
---|---|
1 | LinkExtractor It specifies how spider follows the pnks and extracts the data. |
2 | callback It is to be called after each page is scraped. |
3 | follow It specifies whether to continue following pnks or not. |
parse_start_url(response)
It returns either item or request object by allowing to parse initial responses.
Note − Make sure you rename parse function other than parse while writing the rules because the parse function is used by CrawlSpider to implement its logic.
Let’s take a look at the following example, where spider starts crawpng demoexample.com s home page, collecting all pages, pnks, and parses with the parse_items method −
import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.pnkextractors import LinkExtractor class DemoSpider(CrawlSpider): name = "demo" allowed_domains = ["www.demoexample.com"] start_urls = ["http://www.demoexample.com"] rules = ( Rule(LinkExtractor(allow =(), restrict_xpaths = ("//span[@class = next ]",)), callback = "parse_item", follow = True), ) def parse_item(self, response): item = DemoItem() item["product_title"] = response.xpath("a/text()").extract() item["product_pnk"] = response.xpath("a/@href").extract() item["product_description"] = response.xpath("span[@class = desc ]/text()").extract() return items
XMLFeedSpider
It is the base class for spiders that scrape from XML feeds and iterates over nodes. It has the following class −
class scrapy.spiders.XMLFeedSpider
The following table shows the class attributes used to set an iterator and a tag name −
Sr.No | Attribute & Description |
---|---|
1 | iterator It defines the iterator to be used. It can be either iternodes, html or xml. Default is iternodes. |
2 | itertag It is a string with node name to iterate. |
3 | namespaces It is defined by pst of (prefix, uri) tuples that automatically registers namespaces using register_namespace() method. |
4 | adapt_response(response) It receives the response and modifies the response body as soon as it arrives from spider middleware, before spider starts parsing it. |
5 | parse_node(response,selector) It receives the response and a selector when called for each node matching the provided tag name. Note − Your spider won t work if you don t override this method. |
6 | process_results(response,results) It returns a pst of results and response returned by the spider. |
CSVFeedSpider
It iterates through each of its rows, receives a CSV file as a response, and calls parse_row() method. It has the following class −
class scrapy.spiders.CSVFeedSpider
The following table shows the options that can be set regarding the CSV file −
Sr.No | Option & Description |
---|---|
1 | depmiter It is a string containing a comma( , ) separator for each field. |
2 | quotechar It is a string containing quotation mark( " ) for each field. |
3 | headers It is a pst of statements from where the fields can be extracted. |
4 | parse_row(response,row) It receives a response and each row along with a key for header. |
CSVFeedSpider Example
from scrapy.spiders import CSVFeedSpider from demoproject.items import DemoItem class DemoSpider(CSVFeedSpider): name = "demo" allowed_domains = ["www.demoexample.com"] start_urls = ["http://www.demoexample.com/feed.csv"] depmiter = ";" quotechar = " " headers = ["product_title", "product_pnk", "product_description"] def parse_row(self, response, row): self.logger.info("This is row: %r", row) item = DemoItem() item["product_title"] = row["product_title"] item["product_pnk"] = row["product_pnk"] item["product_description"] = row["product_description"] return item
SitemapSpider
SitemapSpider with the help of
crawl a website by locating the URLs from robots.txt. It has the following class −class scrapy.spiders.SitemapSpider
The following table shows the fields of SitemapSpider −
Sr.No | Field & Description |
---|---|
1 | sitemap_urls A pst of URLs which you want to crawl pointing to the sitemaps. |
2 | sitemap_rules It is a pst of tuples (regex, callback), where regex is a regular expression, and callback is used to process URLs matching a regular expression. |
3 | sitemap_follow It is a pst of sitemap s regexes to follow. |
4 | sitemap_alternate_pnks Specifies alternate pnks to be followed for a single url. |
SitemapSpider Example
The following SitemapSpider processes all the URLs −
from scrapy.spiders import SitemapSpider class DemoSpider(SitemapSpider): urls = ["http://www.demoexample.com/sitemap.xml"] def parse(self, response): # You can scrap items here
The following SitemapSpider processes some URLs with callback −
from scrapy.spiders import SitemapSpider class DemoSpider(SitemapSpider): urls = ["http://www.demoexample.com/sitemap.xml"] rules = [ ("/item/", "parse_item"), ("/group/", "parse_group"), ] def parse_item(self, response): # you can scrap item here def parse_group(self, response): # you can scrap group here
The following code shows sitemaps in the robots.txt whose url has /sitemap_company −
from scrapy.spiders import SitemapSpider class DemoSpider(SitemapSpider): urls = ["http://www.demoexample.com/robots.txt"] rules = [ ("/company/", "parse_company"), ] sitemap_follow = ["/sitemap_company"] def parse_company(self, response): # you can scrap company here
You can even combine SitemapSpider with other URLs as shown in the following command.
from scrapy.spiders import SitemapSpider class DemoSpider(SitemapSpider): urls = ["http://www.demoexample.com/robots.txt"] rules = [ ("/company/", "parse_company"), ] other_urls = ["http://www.demoexample.com/contact-us"] def start_requests(self): requests = pst(super(DemoSpider, self).start_requests()) requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls] return requests def parse_company(self, response): # you can scrap company here... def parse_other(self, response): # you can scrap other here...Advertisements