Scrapy Basic Concepts
- Scrapy - Exceptions
- Scrapy - Settings
- Scrapy - Link Extractors
- Scrapy - Requests & Responses
- Scrapy - Feed exports
- Scrapy - Item Pipeline
- Scrapy - Shell
- Scrapy - Item Loaders
- Scrapy - Items
- Scrapy - Selectors
- Scrapy - Spiders
- Scrapy - Command Line Tools
- Scrapy - Environment
- Scrapy - Overview
Scrapy Live Project
- Scrapy - Scraped Data
- Scrapy - Following Links
- Scrapy - Using an Item
- Scrapy - Extracting Items
- Scrapy - Crawling
- Scrapy - First Spider
- Scrapy - Define an Item
- Scrapy - Create a Project
Scrapy Built In Services
- Scrapy - Web Services
- Scrapy - Telnet Console
- Scrapy - Sending an E-mail
- Scrapy - Stats Collection
- Scrapy - Logging
Scrapy Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Scrapy - Settings
Description
The behavior of Scrapy components can be modified using Scrapy settings. The settings can also select the Scrapy project that is currently active, in case you have multiple Scrapy projects.
Designating the Settings
You must notify Scrapy which setting you are using when you scrap a website. For this, environment variable SCRAPY_SETTINGS_MODULE should be used and its value should be in Python path syntax.
Populating the Settings
The following table shows some of the mechanisms by which you can populate the settings −
Sr.No | Mechanism & Description |
---|---|
1 | Command pne options Here, the arguments that are passed takes highest precedence by overriding other options. The -s is used to override one or more settings. scrapy crawl myspider -s LOG_FILE = scrapy.log |
2 | Settings per-spider Spiders can have their own settings that overrides the project ones by using attribute custom_settings. class DemoSpider(scrapy.Spider): name = demo custom_settings = { SOME_SETTING : some value , } |
3 | Project settings module Here, you can populate your custom settings such as adding or modifying the settings in the settings.py file. |
4 | Default settings per-command Each Scrapy tool command defines its own settings in the default_settings attribute, to override the global default settings. |
5 | Default global settings These settings are found in the scrapy.settings.default_settings module. |
Access Settings
They are available through self.settings and set in the base spider after it is initiapzed.
The following example demonstrates this.
class DemoSpider(scrapy.Spider): name = demo start_urls = [ http://example.com ] def parse(self, response): print("Existing settings: %s" % self.settings.attributes.keys())
To use settings before initiapzing the spider, you must override from_crawler method in the _init_() method of your spider. You can access settings through attribute scrapy.crawler.Crawler.settings passed to from_crawler method.
The following example demonstrates this.
class MyExtension(object): def __init__(self, log_is_enabled = False): if log_is_enabled: print("Enabled log") @classmethod def from_crawler(cls, crawler): settings = crawler.settings return cls(settings.getbool( LOG_ENABLED ))
Rationale for Setting Names
Setting names are added as a prefix to the component they configure. For example, for robots.txt extension, the setting names can be ROBOTSTXT_ENABLED, ROBOTSTXT_OBEY, ROBOTSTXT_CACHEDIR, etc.
Built-in Settings Reference
The following table shows the built-in settings of Scrapy −
Sr.No | Setting & Description |
---|---|
1 | AWS_ACCESS_KEY_ID It is used to access Amazon Web Services. Default value: None |
2 | AWS_SECRET_ACCESS_KEY It is used to access Amazon Web Services. Default value: None |
3 | BOT_NAME It is the name of bot that can be used for constructing User-Agent. Default value: scrapybot |
4 | CONCURRENT_ITEMS Maximum number of existing items in the Item Processor used to process parallely. Default value: 100 |
5 | CONCURRENT_REQUESTS Maximum number of existing requests which Scrapy downloader performs. Default value: 16 |
6 | CONCURRENT_REQUESTS_PER_DOMAIN Maximum number of existing requests that perform simultaneously for any single domain. Default value: 8 |
7 | CONCURRENT_REQUESTS_PER_IP Maximum number of existing requests that performs simultaneously to any single IP. Default value: 0 |
8 | DEFAULT_ITEM_CLASS It is a class used to represent items. Default value: scrapy.item.Item |
9 | DEFAULT_REQUEST_HEADERS It is a default header used for HTTP requests of Scrapy. Default value − { Accept : text/html,apppcation/xhtml+xml,apppcation/xml;q=0.9, */*;q=0.8 , Accept-Language : en , } |
10 | DEPTH_LIMIT The maximum depth for a spider to crawl any site. Default value: 0 |
11 | DEPTH_PRIORITY It is an integer used to alter the priority of request according to the depth. Default value: 0 |
12 | DEPTH_STATS It states whether to collect depth stats or not. Default value: True |
13 | DEPTH_STATS_VERBOSE This setting when enabled, the number of requests is collected in stats for each verbose depth. Default value: False |
14 | DNSCACHE_ENABLED It is used to enable DNS in memory cache. Default value: True |
15 | DNSCACHE_SIZE It defines the size of DNS in memory cache. Default value: 10000 |
16 | DNS_TIMEOUT It is used to set timeout for DNS to process the queries. Default value: 60 |
17 | DOWNLOADER It is a downloader used for the crawpng process. Default value: scrapy.core.downloader.Downloader |
18 | DOWNLOADER_MIDDLEWARES It is a dictionary holding downloader middleware and their orders. Default value: {} |
19 | DOWNLOADER_MIDDLEWARES_BASE It is a dictionary holding downloader middleware that is enabled by default. Default value − { scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware : 100, } |
20 | DOWNLOADER_STATS This setting is used to enable the downloader stats. Default value: True |
21 | DOWNLOAD_DELAY It defines the total time for downloader before it downloads the pages from the site. Default value: 0 |
22 | DOWNLOAD_HANDLERS It is a dictionary with download handlers. Default value: {} |
23 | DOWNLOAD_HANDLERS_BASE It is a dictionary with download handlers that is enabled by default. Default value − { file : scrapy.core.downloader.handlers.file.FileDownloadHandler , } |
24 | DOWNLOAD_TIMEOUT It is the total time for downloader to wait before it times out. Default value: 180 |
25 | DOWNLOAD_MAXSIZE It is the maximum size of response for the downloader to download. Default value: 1073741824 (1024MB) |
26 | DOWNLOAD_WARNSIZE It defines the size of response for downloader to warn. Default value: 33554432 (32MB) |
27 | DUPEFILTER_CLASS It is a class used for detecting and filtering of requests that are duppcate. Default value: scrapy.dupefilters.RFPDupeFilter |
28 | DUPEFILTER_DEBUG This setting logs all duppcate filters when set to true. Default value: False |
29 | EDITOR It is used to edit spiders using the edit command. Default value: Depends on the environment |
30 | EXTENSIONS It is a dictionary having extensions that are enabled in the project. Default value: {} |
31 | EXTENSIONS_BASE It is a dictionary having built-in extensions. Default value: { scrapy.extensions.corestats.CoreStats : 0, } |
32 | FEED_TEMPDIR It is a directory used to set the custom folder where crawler temporary files can be stored. |
33 | ITEM_PIPELINES It is a dictionary having pipepnes. Default value: {} |
34 | LOG_ENABLED It defines if the logging is to be enabled. Default value: True |
35 | LOG_ENCODING It defines the type of encoding to be used for logging. Default value: utf-8 |
36 | LOG_FILE It is the name of the file to be used for the output of logging. Default value: None |
37 | LOG_FORMAT It is a string using which the log messages can be formatted. Default value: %(asctime)s [%(name)s] %(levelname)s: %(message)s |
38 | LOG_DATEFORMAT It is a string using which date/time can be formatted. Default value: %Y-%m-%d %H:%M:%S |
39 | LOG_LEVEL It defines minimum log level. Default value: DEBUG |
40 | LOG_STDOUT This setting if set to true, all your process output will appear in the log. Default value: False |
41 | MEMDEBUG_ENABLED It defines if the memory debugging is to be enabled. Default Value: False |
42 | MEMDEBUG_NOTIFY It defines the memory report that is sent to a particular address when memory debugging is enabled. Default value: [] |
43 | MEMUSAGE_ENABLED It defines if the memory usage is to be enabled when a Scrapy process exceeds a memory pmit. Default value: False |
44 | MEMUSAGE_LIMIT_MB It defines the maximum pmit for the memory (in megabytes) to be allowed. Default value: 0 |
45 | MEMUSAGE_CHECK_INTERVAL_SECONDS It is used to check the present memory usage by setting the length of the intervals. Default value: 60.0 |
46 | MEMUSAGE_NOTIFY_MAIL It is used to notify with a pst of emails when the memory reaches the pmit. Default value: False |
47 | MEMUSAGE_REPORT It defines if the memory usage report is to be sent on closing each spider. Default value: False |
48 | MEMUSAGE_WARNING_MB It defines a total memory to be allowed before a warning is sent. Default value: 0 |
49 | NEWSPIDER_MODULE It is a module where a new spider is created using genspider command. Default value: |
50 | RANDOMIZE_DOWNLOAD_DELAY It defines a random amount of time for a Scrapy to wait while downloading the requests from the site. Default value: True |
51 | REACTOR_THREADPOOL_MAXSIZE It defines a maximum size for the reactor threadpool. Default value: 10 |
52 | REDIRECT_MAX_TIMES It defines how many times a request can be redirected. Default value: 20 |
53 | REDIRECT_PRIORITY_ADJUST This setting when set, adjusts the redirect priority of a request. Default value: +2 |
54 | RETRY_PRIORITY_ADJUST This setting when set, adjusts the retry priority of a request. Default value: -1 |
55 | ROBOTSTXT_OBEY Scrapy obeys robots.txt popcies when set to true. Default value: False |
56 | SCHEDULER It defines the scheduler to be used for crawl purpose. Default value: scrapy.core.scheduler.Scheduler |
57 | SPIDER_CONTRACTS It is a dictionary in the project having spider contracts to test the spiders. Default value: {} |
58 | SPIDER_CONTRACTS_BASE It is a dictionary holding Scrapy contracts which is enabled in Scrapy by default. Default value − { scrapy.contracts.default.UrlContract : 1, scrapy.contracts.default.ReturnsContract : 2, } |
59 | SPIDER_LOADER_CLASS It defines a class which implements SpiderLoader API to load spiders. Default value: scrapy.spiderloader.SpiderLoader |
60 | SPIDER_MIDDLEWARES It is a dictionary holding spider middlewares. Default value: {} |
61 | SPIDER_MIDDLEWARES_BASE It is a dictionary holding spider middlewares that is enabled in Scrapy by default. Default value − { scrapy.spidermiddlewares.httperror.HttpErrorMiddleware : 50, } |
62 | SPIDER_MODULES It is a pst of modules containing spiders which Scrapy will look for. Default value: [] |
63 | STATS_CLASS It is a class which implements Stats Collector API to collect stats. Default value: scrapy.statscollectors.MemoryStatsCollector |
64 | STATS_DUMP This setting when set to true, dumps the stats to the log. Default value: True |
65 | STATSMAILER_RCPTS Once the spiders finish scraping, Scrapy uses this setting to send the stats. Default value: [] |
66 | TELNETCONSOLE_ENABLED It defines whether to enable the telnetconsole. Default value: True |
67 | TELNETCONSOLE_PORT It defines a port for telnet console. Default value: [6023, 6073] |
68 | TEMPLATES_DIR It is a directory containing templates that can be used while creating new projects. Default value: templates directory inside scrapy module |
69 | URLLENGTH_LIMIT It defines the maximum pmit of the length for URL to be allowed for crawled URLs. Default value: 2083 |
70 | USER_AGENT It defines the user agent to be used while crawpng a site. Default value: "Scrapy/VERSION (+http://scrapy.org)" |
For other Scrapy settings, go to this
. Advertisements