Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
$ scrapy -h Scrapy 2.11.0 - active project: fz_spider
Usage: scrapy <command> [options] [args]
Available commands: bench Run quick benchmark test check Check spider contracts crawl Run a spider edit Edit spider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider) and print the results runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
defstart_requests(self): meta = { # 'dont_redirect': True, ## handle_httpstatus_all:True会处理所有的http status,默认只会处理200-300之间的正确响应码 'handle_httpstatus_all': True } withopen('other_202309271402', encoding='utf-8') as input_data: urls = input_data.readlines() for iurl in urls: yield Request(url='http://{url}'.format(url=iurl), callback=self.parse, meta=meta)
Request对象有几个需要关注的keyword参数:
一个是url,这个很明显,就是需要爬取的URL。
一个是callback,它指定了回调函数为parse方法。
最后是meta,它指定了爬虫的一些细节,如,是否进行302重定向,是否处理所有的http status code。以处理所有http status code为例,scrapy默认只会处理状态码为200-300之间的正确响应,对于4xx和5xx的响应码会丢弃,也就不会进入parse方法。如果我们想记录那些4xx和5xx的URL,并在事后进行分析,就需要在meta中添加对应的参数配置(handle_httpstatus_all设置为True)。