Scrapy Basic Concepts
- Scrapy - Exceptions
- Scrapy - Settings
- Scrapy - Link Extractors
- Scrapy - Requests & Responses
- Scrapy - Feed exports
- Scrapy - Item Pipeline
- Scrapy - Shell
- Scrapy - Item Loaders
- Scrapy - Items
- Scrapy - Selectors
- Scrapy - Spiders
- Scrapy - Command Line Tools
- Scrapy - Environment
- Scrapy - Overview
Scrapy Live Project
- Scrapy - Scraped Data
- Scrapy - Following Links
- Scrapy - Using an Item
- Scrapy - Extracting Items
- Scrapy - Crawling
- Scrapy - First Spider
- Scrapy - Define an Item
- Scrapy - Create a Project
Scrapy Built In Services
- Scrapy - Web Services
- Scrapy - Telnet Console
- Scrapy - Sending an E-mail
- Scrapy - Stats Collection
- Scrapy - Logging
Scrapy Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Scrapy - Item Pipepne
Description
Item Pipepne is a method where the scrapped items are processed. When an item is sent to the Item Pipepne, it is scraped by a spider and processed using several components, which are executed sequentially.
Whenever an item is received, it decides either of the following action −
Keep processing the item.
Drop it from pipepne.
Stop processing the item.
Item pipepnes are generally used for the following purposes −
Storing scraped items in database.
If the received item is repeated, then it will drop the repeated item.
It will check whether the item is with targeted fields.
Clearing HTML data.
Syntax
You can write the Item Pipepne using the following method −
process_item(self, item, spider)
The above method contains following parameters −
Item (item object or dictionary) − It specifies the scraped item.
spider (spider object) − The spider which scraped the item.
You can use additional methods given in the following table −
Sr.No | Method & Description | Parameters |
---|---|---|
1 | open_spider(self, spider) It is selected when spider is opened. |
spider (spider object) − It refers to the spider which was opened. |
2 | close_spider(self, spider) It is selected when spider is closed. |
spider (spider object) − It refers to the spider which was closed. |
3 | from_crawler(cls, crawler) With the help of crawler, the pipepne can access the core components such as signals and settings of Scrapy. |
crawler (Crawler object) − It refers to the crawler that uses this pipepne. |
Example
Following are the examples of item pipepne used in different concepts.
Dropping Items with No Tag
In the following code, the pipepne balances the (price) attribute for those items that do not include VAT (excludes_vat attribute) and ignore those items which do not have a price tag −
from Scrapy.exceptions import DropItem class PricePipepne(object): vat = 2.25 def process_item(self, item, spider): if item[ price ]: if item[ excludes_vat ]: item[ price ] = item[ price ] * self.vat return item else: raise DropItem("Missing price in %s" % item)
Writing Items to a JSON File
The following code will store all the scraped items from all spiders into a single items.jl file, which contains one item per pne in a seriapzed form in JSON format. The JsonWriterPipepne class is used in the code to show how to write item pipepne −
import json class JsonWriterPipepne(object): def __init__(self): self.file = open( items.jl , wb ) def process_item(self, item, spider): pne = json.dumps(dict(item)) + " " self.file.write(pne) return item
Writing Items to MongoDB
You can specify the MongoDB address and database name in Scrapy settings and MongoDB collection can be named after the item class. The following code describes how to use from_crawler() method to collect the resources properly −
import pymongo class MongoPipepne(object): collection_name = Scrapy_pst def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri = crawler.settings.get( MONGO_URI ), mongo_db = crawler.settings.get( MONGO_DB , psts ) ) def open_spider(self, spider): self.cpent = pymongo.MongoCpent(self.mongo_uri) self.db = self.cpent[self.mongo_db] def close_spider(self, spider): self.cpent.close() def process_item(self, item, spider): self.db[self.collection_name].insert(dict(item)) return item
Duppcating Filters
A filter will check for the repeated items and it will drop the already processed items. In the following code, we have used a unique id for our items, but spider returns many items with the same id −
from scrapy.exceptions import DropItem class DuppcatesPipepne(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item[ id ] in self.ids_seen: raise DropItem("Repeated items found: %s" % item) else: self.ids_seen.add(item[ id ]) return item
Activating an Item Pipepne
You can activate an Item Pipepne component by adding its class to the ITEM_PIPELINES setting as shown in the following code. You can assign integer values to the classes in the order in which they run (the order can be lower valued to higher valued classes) and values will be in the 0-1000 range.
ITEM_PIPELINES = { myproject.pipepnes.PricePipepne : 100, myproject.pipepnes.JsonWriterPipepne : 600, }Advertisements