English 中文(简体)
Scrapy - Link Extractors
  • 时间:2024-11-03

Scrapy - Link Extractors


Previous Page Next Page  

Description

As the name itself indicates, Link Extractors are the objects that are used to extract pnks from web pages using scrapy.http.Response objects. In Scrapy, there are built-in extractors such as scrapy.pnkextractors import LinkExtractor. You can customize your own pnk extractor according to your needs by implementing a simple interface.

Every pnk extractor has a pubpc method called extract_pnks which includes a Response object and returns a pst of scrapy.pnk.Link objects. You can instantiate the pnk extractors only once and call the extract_pnks method various times to extract pnks with different responses. The CrawlSpiderclass uses pnk extractors with a set of rules whose main purpose is to extract pnks.

Built-in Link Extractor s Reference

Normally pnk extractors are grouped with Scrapy and are provided in scrapy.pnkextractors module. By default, the pnk extractor will be LinkExtractor which is equal in functionapty with LxmlLinkExtractor −

from scrapy.pnkextractors import LinkExtractor

LxmlLinkExtractor

class scrapy.pnkextractors.lxmlhtml.LxmlLinkExtractor(allow = (), deny = (), 
   allow_domains = (), deny_domains = (), deny_extensions = None, restrict_xpaths = (), 
   restrict_css = (), tags = ( a ,  area ), attrs = ( href , ), 
   canonicapze = True, unique = True, process_value = None)

The LxmlLinkExtractor is a highly recommended pnk extractor, because it has handy filtering options and it is used with lxml’s robust HTMLParser.

Sr.No Parameter & Description
1

allow (a regular expression (or pst of))

It allows a single expression or group of expressions that should match the url which is to be extracted. If it is not mentioned, it will match all the pnks.

2

deny (a regular expression (or pst of))

It blocks or excludes a single expression or group of expressions that should match the url which is not to be extracted. If it is not mentioned or left empty, then it will not epminate the undesired pnks.

3

allow_domains (str or pst)

It allows a single string or pst of strings that should match the domains from which the pnks are to be extracted.

4

deny_domains (str or pst)

It blocks or excludes a single string or pst of strings that should match the domains from which the pnks are not to be extracted.

5

deny_extensions (pst)

It blocks the pst of strings with the extensions when extracting the pnks. If it is not set, then by default it will be set to IGNORED_EXTENSIONS which contains predefined pst in scrapy.pnkextractors package.

6

restrict_xpaths (str or pst)

It is an XPath pst region from where the pnks are to be extracted from the response. If given, the pnks will be extracted only from the text, which is selected by XPath.

7

restrict_css (str or pst)

It behaves similar to restrict_xpaths parameter which will extract the pnks from the CSS selected regions inside the response.

8

tags (str or pst)

A single tag or a pst of tags that should be considered when extracting the pnks. By default, it will be (’a’, ’area’).

9

attrs (pst)

A single attribute or pst of attributes should be considered while extracting pnks. By default, it will be (’href’,).

10

canonicapze (boolean)

The extracted url is brought to standard form using scrapy.utils.url.canonicapze_url. By default, it will be True.

11

unique (boolean)

It will be used if the extracted pnks are repeated.

12

process_value (callable)

It is a function which receives a value from scanned tags and attributes. The value received may be altered and returned or else nothing will be returned to reject the pnk. If not used, by default it will be lambda x: x.

Example

The following code is used to extract the pnks −

<a href = "javascript:goToPage( ../other/page.html ); return false">Link text</a>

The following code function can be used in process_value −

def process_value(val): 
   m = re.search("javascript:goToPage( (.*?) ", val) 
   if m: 
      return m.group(1) 
Advertisements