Scrapy Basic Concepts
- Scrapy - Exceptions
- Scrapy - Settings
- Scrapy - Link Extractors
- Scrapy - Requests & Responses
- Scrapy - Feed exports
- Scrapy - Item Pipeline
- Scrapy - Shell
- Scrapy - Item Loaders
- Scrapy - Items
- Scrapy - Selectors
- Scrapy - Spiders
- Scrapy - Command Line Tools
- Scrapy - Environment
- Scrapy - Overview
Scrapy Live Project
- Scrapy - Scraped Data
- Scrapy - Following Links
- Scrapy - Using an Item
- Scrapy - Extracting Items
- Scrapy - Crawling
- Scrapy - First Spider
- Scrapy - Define an Item
- Scrapy - Create a Project
Scrapy Built In Services
- Scrapy - Web Services
- Scrapy - Telnet Console
- Scrapy - Sending an E-mail
- Scrapy - Stats Collection
- Scrapy - Logging
Scrapy Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Scrapy - Item Loaders
Description
Item loaders provide a convenient way to fill the items that are scraped from the websites.
Declaring Item Loaders
The declaration of Item Loaders is pke Items.
For example −
from scrapy.loader import ItemLoader from scrapy.loader.processors import TakeFirst, MapCompose, Join class DemoLoader(ItemLoader): default_output_processor = TakeFirst() title_in = MapCompose(unicode.title) title_out = Join() size_in = MapCompose(unicode.strip) # you can continue scraping here
In the above code, you can see that input processors are declared using _in suffix and output processors are declared using _out suffix.
The ItemLoader.default_input_processor and ItemLoader.default_output_processor attributes are used to declare default input/output processors.
Using Item Loaders to Populate Items
To use Item Loader, first instantiate with dict-pke object or without one where the item uses Item class specified in ItemLoader.default_item_class attribute.
You can use selectors to collect values into the Item Loader.
You can add more values in the same item field, where Item Loader will use an appropriate handler to add these values.
The following code demonstrates how items are populated using Item Loaders −
from scrapy.loader import ItemLoader from demoproject.items import Demo def parse(self, response): l = ItemLoader(item = Product(), response = response) l.add_xpath("title", "//span[@class = product_title ]") l.add_xpath("title", "//span[@class = product_name ]") l.add_xpath("desc", "//span[@class = desc ]") l.add_css("size", "span#size]") l.add_value("last_updated", "yesterday") return l.load_item()
As shown above, there are two different XPaths from which the title field is extracted using add_xpath() method −
1. //span[@class = "product_title"]
2. //span[@class = "product_name"]
Thereafter, a similar request is used for desc field. The size data is extracted using add_css() method and last_updated is filled with a value "yesterday" using add_value() method.
Once all the data is collected, call ItemLoader.load_item() method which returns the items filled with data extracted using add_xpath(), add_css() and add_value() methods.
Input and Output Processors
Each field of an Item Loader contains one input processor and one output processor.
When data is extracted, input processor processes it and its result is stored in ItemLoader.
Next, after collecting the data, call ItemLoader.load_item() method to get the populated Item object.
Finally, you can assign the result of the output processor to the item.
The following code demonstrates how to call input and output processors for a specific field −
l = ItemLoader(Product(), some_selector) l.add_xpath("title", xpath1) # [1] l.add_xpath("title", xpath2) # [2] l.add_css("title", css) # [3] l.add_value("title", "demo") # [4] return l.load_item() # [5]
Line 1 − The data of title is extracted from xpath1 and passed through the input processor and its result is collected and stored in ItemLoader.
Line 2 − Similarly, the title is extracted from xpath2 and passed through the same input processor and its result is added to the data collected for [1].
Line 3 − The title is extracted from css selector and passed through the same input processor and the result is added to the data collected for [1] and [2].
Line 4 − Next, the value "demo" is assigned and passed through the input processors.
Line 5 − Finally, the data is collected internally from all the fields and passed to the output processor and the final value is assigned to the Item.
Declaring Input and Output Processors
The input and output processors are declared in the ItemLoader definition. Apart from this, they can also be specified in the Item Field metadata.
For example −
import scrapy from scrapy.loader.processors import Join, MapCompose, TakeFirst from w3pb.html import remove_tags def filter_size(value): if value.isdigit(): return value class Item(scrapy.Item): name = scrapy.Field( input_processor = MapCompose(remove_tags), output_processor = Join(), ) size = scrapy.Field( input_processor = MapCompose(remove_tags, filter_price), output_processor = TakeFirst(), ) >>> from scrapy.loader import ItemLoader >>> il = ItemLoader(item = Product()) >>> il.add_value( title , [u Hello , u <strong>world</strong> ]) >>> il.add_value( size , [u <span>100 kg</span> ]) >>> il.load_item()
It displays an output as −
{ title : u Hello world , size : u 100 kg }
Item Loader Context
The Item Loader Context is a dict of arbitrary key values shared among input and output processors.
For example, assume you have a function parse_length −
def parse_length(text, loader_context): unit = loader_context.get( unit , cm ) # You can write parsing code of length here return parsed_length
By receiving loader_context arguements, it tells the Item Loader it can receive Item Loader context. There are several ways to change the value of Item Loader context −
Modify current active Item Loader context −
loader = ItemLoader (product) loader.context ["unit"] = "mm"
On Item Loader instantiation −
loader = ItemLoader(product, unit = "mm")
On Item Loader declaration for input/output processors that instantiates with Item Loader context −
class ProductLoader(ItemLoader): length_out = MapCompose(parse_length, unit = "mm")
ItemLoader Objects
It is an object which returns a new item loader to populate the given item. It has the following class −
class scrapy.loader.ItemLoader([item, selector, response, ]**kwargs)
The following table shows the parameters of ItemLoader objects −
Sr.No | Parameter & Description |
---|---|
1 | item It is the item to populate by calpng add_xpath(), add_css() or add_value(). |
2 | selector It is used to extract data from websites. |
3 | response It is used to construct selector using default_selector_class. |
Following table shows the methods of ItemLoader objects −
Sr.No | Method & Description | Example |
---|---|---|
1 | get_value(value, *processors, **kwargs) By a given processor and keyword arguments, the value is processed by get_value() method. |
>>> from scrapy.loader.processors import TakeFirst >>> loader.get_value(u title: demoweb , TakeFirst(), unicode.upper, re = title: (.+) ) DEMOWEB` |
2 | add_value(field_name, value, *processors, **kwargs) It processes the value and adds to the field where it is first passed through get_value by giving processors and keyword arguments before passing through field input processor. |
loader.add_value( title , u DVD ) loader.add_value( colors , [u black , u white ]) loader.add_value( length , u 80 ) loader.add_value( price , u 2500 ) |
3 | replace_value(field_name, value, *processors, **kwargs) It replaces the collected data with a new value. |
loader.replace_value( title , u DVD ) loader.replace_value( colors , [u black , u white ]) loader.replace_value( length , u 80 ) loader.replace_value( price , u 2500 ) |
4 | get_xpath(xpath, *processors, **kwargs) It is used to extract unicode strings by giving processors and keyword arguments by receiving XPath. |
# HTML code: <span class = "item-name">DVD</span> loader.get_xpath("//span[@class = item-name ]") # HTML code: <span id = "length">the length is 45cm</span> loader.get_xpath("//span[@id = length ]", TakeFirst(), re = "the length is (.*)") |
5 | add_xpath(field_name, xpath, *processors, **kwargs) It receives XPath to the field which extracts unicode strings. |
# HTML code: <span class = "item-name">DVD</span> loader.add_xpath( name , //span [@class = "item-name"] ) # HTML code: <span id = "length">the length is 45cm</span> loader.add_xpath( length , //span[@id = "length"] , re = the length is (.*) ) |
6 | replace_xpath(field_name, xpath, *processors, **kwargs) It replaces the collected data using XPath from sites. |
# HTML code: <span class = "item-name">DVD</span> loader.replace_xpath( name , //span[@class = "item-name"] ) # HTML code: <span id = "length">the length is 45cm</span> loader.replace_xpath( length , //span[@id = "length"] , re = the length is (.*) ) |
7 | get_css(css, *processors, **kwargs) It receives CSS selector used to extract the unicode strings. |
loader.get_css("span.item-name") loader.get_css("span#length", TakeFirst(), re = "the length is (.*)") |
8 | add_css(field_name, css, *processors, **kwargs) It is similar to add_value() method with one difference that it adds CSS selector to the field. |
loader.add_css( name , span.item-name ) loader.add_css( length , span#length , re = the length is (.*) ) |
9 | replace_css(field_name, css, *processors, **kwargs) It replaces the extracted data using CSS selector. |
loader.replace_css( name , span.item-name ) loader.replace_css( length , span#length , re = the length is (.*) ) |
10 | load_item() When the data is collected, this method fills the item with collected data and returns it. |
def parse(self, response): l = ItemLoader(item = Product(), response = response) l.add_xpath( title , // span[@class = "product_title"] ) loader.load_item() |
11 | nested_xpath(xpath) It is used to create nested loaders with an XPath selector. |
loader = ItemLoader(item = Item()) loader.add_xpath( social , a[@class = "social"]/@href ) loader.add_xpath( email , a[@class = "email"]/@href ) |
12 | nested_css(css) It is used to create nested loaders with a CSS selector. |
loader = ItemLoader(item = Item()) loader.add_css( social , a[@class = "social"]/@href ) loader.add_css( email , a[@class = "email"]/@href ) |
Following table shows the attributes of ItemLoader objects −
Sr.No | Attribute & Description |
---|---|
1 | item It is an object on which the Item Loader performs parsing. |
2 | context It is the current context of Item Loader that is active. |
3 | default_item_class It is used to represent the items, if not given in the constructor. |
4 | default_input_processor The fields which don t specify input processor are the only ones for which default_input_processors are used. |
5 | default_output_processor The fields which don t specify the output processor are the only ones for which default_output_processors are used. |
6 | default_selector_class It is a class used to construct the selector, if it is not given in the constructor. |
7 | selector It is an object that can be used to extract the data from sites. |
Nested Loaders
It is used to create nested loaders while parsing the values from the subsection of a document. If you don t create nested loaders, you need to specify full XPath or CSS for each value that you want to extract.
For instance, assume that the data is being extracted from a header page −
<header> <a class = "social" href = "http://facebook.com/whatever">facebook</a> <a class = "social" href = "http://twitter.com/whatever">twitter</a> <a class = "email" href = "mailto:someone@example.com">send mail</a> </header>
Next, you can create a nested loader with header selector by adding related values to the header −
loader = ItemLoader(item = Item()) header_loader = loader.nested_xpath( //header ) header_loader.add_xpath( social , a[@class = "social"]/@href ) header_loader.add_xpath( email , a[@class = "email"]/@href ) loader.load_item()
Reusing and extending Item Loaders
Item Loaders are designed to repeve the maintenance which becomes a fundamental problem when your project acquires more spiders.
For instance, assume that a site has their product name enclosed in three dashes (e.g. --DVD---). You can remove those dashes by reusing the default Product Item Loader, if you don’t want it in the final product names as shown in the following code −
from scrapy.loader.processors import MapCompose from demoproject.ItemLoaders import DemoLoader def strip_dashes(x): return x.strip( - ) class SiteSpecificLoader(DemoLoader): title_in = MapCompose(strip_dashes, DemoLoader.title_in)
Available Built-in Processors
Following are some of the commonly used built-in processors −
class scrapy.loader.processors.Identity
It returns the original value without altering it. For example −
>>> from scrapy.loader.processors import Identity >>> proc = Identity() >>> proc([ a , b , c ]) [ a , b , c ]
class scrapy.loader.processors.TakeFirst
It returns the first value that is non-null/non-empty from the pst of received values. For example −
>>> from scrapy.loader.processors import TakeFirst >>> proc = TakeFirst() >>> proc([ , a , b , c ]) a
class scrapy.loader.processors.Join(separator = u )
It returns the value attached to the separator. The default separator is u and it is equivalent to the function u .join. For example −
>>> from scrapy.loader.processors import Join >>> proc = Join() >>> proc([ a , b , c ]) u a b c >>> proc = Join( <br> ) >>> proc([ a , b , c ]) u a<br>b<br>c
class scrapy.loader.processors.Compose(*functions, **default_loader_context)
It is defined by a processor where each of its input value is passed to the first function, and the result of that function is passed to the second function and so on, till lthe ast function returns the final value as output.
For example −
>>> from scrapy.loader.processors import Compose >>> proc = Compose(lambda v: v[0], str.upper) >>> proc([ python , scrapy ]) PYTHON
class scrapy.loader.processors.MapCompose(*functions, **default_loader_context)
It is a processor where the input value is iterated and the first function is appped to each element. Next, the result of these function calls are concatenated to build new iterable that is then appped to the second function and so on, till the last function.
For example −
>>> def filter_scrapy(x): return None if x == scrapy else x >>> from scrapy.loader.processors import MapCompose >>> proc = MapCompose(filter_scrapy, unicode.upper) >>> proc([u hi , u everyone , u im , u pythonscrapy ]) [u HI, u IM , u PYTHONSCRAPY ]
class scrapy.loader.processors.SelectJmes(json_path)
This class queries the value using the provided json path and returns the output.
For example −
>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose >>> proc = SelectJmes("hello") >>> proc({ hello : scrapy }) scrapy >>> proc({ hello : { scrapy : world }}) { scrapy : world }
Following is the code, which queries the value by importing json −
>>> import json >>> proc_single_json_str = Compose(json.loads, SelectJmes("hello")) >>> proc_single_json_str( {"hello": "scrapy"} ) u scrapy >>> proc_json_pst = Compose(json.loads, MapCompose(SelectJmes( hello ))) >>> proc_json_pst( [{"hello":"scrapy"}, {"world":"env"}] ) [u scrapy ]Advertisements