要保存Scrapy响应的屏幕截图,可以使用Scrapy的middlewares中的一个叫做ScreenshotMiddleware
的中间件。以下是一个包含代码示例的解决方法:
首先,确保安装了scrapy-splash
和Pillow
这两个Python库。
在Scrapy项目的settings.py
文件中添加以下配置:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'your_project_name.middlewares.ScreenshotMiddleware': 800, # 添加此行
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
SPLASH_URL = 'http://localhost:8050/'
middlewares.py
文件中添加以下代码:from scrapy import signals
from scrapy.http import HtmlResponse
from PIL import Image
class ScreenshotMiddleware:
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
return middleware
def spider_opened(self, spider):
self.spider = spider
def spider_closed(self, spider):
self.spider = None
def process_response(self, request, response, spider):
if spider and response.status == 200:
screenshot = self.take_screenshot(response)
screenshot.save(f"{spider.name}_{response.url.replace('/', '_')}.png")
return response
def take_screenshot(self, response):
image = Image.open(response.body)
return image
ScreenshotMiddleware
中间件会在每个响应返回时调用take_screenshot
方法将响应内容转换为图像,并将图像保存为PNG文件。保存的文件名由爬虫名和响应URL组成,其中斜杠(/)被下划线(_)替换。请注意,此代码示例假设你已经配置了Splash服务并将其运行在http://localhost:8050/
上。如果你使用的是其他服务或端口,请相应地修改splash_url
的值。
希望这可以解决你的问题!