Scrapy 实现博客爬虫实例
目标网址:http://i.csensix.com (即本站)
项目地址:blogSpider
系统环境
- CentOS 7.5
- Python 2.7.16
- Scrapy 1.7.3
实现过程
创建项目
scrapy startproject blogSpider
执行完上面的命令,会生成如下目录结构:
blogSpider/
scrapy.cfg # 配置文件
blogSpider/ # 主要代码目录
__init__.py
items.py # 定义项目items
middlewares.py # 项目中间件文件
pipelines.py # 项目管道文件
settings.py # 项目配置
spiders/ # 爬虫目录
__init__.py
编写爬虫 blogSpider/blogSpider/spiders/blog.py
对于html的解析,这里主要用的是.xpath()
方法,Scrapy还提供了.css()
方法可以使用,至于用哪一种,看个人习惯
# -*- coding: utf-8 -*-
# 下面三句作用:防止中文出现乱码
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import scrapy
from blogSpider.items import BlogspiderItem
class BlogSpider(scrapy.Spider):
name = "blog"
allowed_domains = ['csensix.com']
start_urls = [
'http://i.csensix.com/',
]
# 为spider指定pipeline,会覆盖settings.py里面的设置
custom_settings = {
'ITEM_PIPELINES': {'blogSpider.pipelines.SqlitePipeline': 300,}
}
def parse(self, response):
article_list = response.xpath('//article[@class="post"]')
for article in article_list:
# 爬取标题链接
href = article.xpath('./h2/a/@href').get()
# 爬取详情页
yield scrapy.Request(
href,
callback = self.parse_detail
)
# 翻页
next_url = response.xpath('//div[@id="main"]/ol/li[@class="next"]/a/@href').get()
if next_url is not None:
yield scrapy.Request(next_url, callback = self.parse)
def parse_detail(self, response):
item = BlogspiderItem()
item['title'] = response.xpath('//article/h1[@class="post-title"]/a/text()').get()
item['href'] = response.xpath('//article/h1[@class="post-title"]/a/@href').get()
item['post_id'] = item['href'].split('/')[-2]
item['author'] = response.xpath('//article/ul[@class="post-meta"]/li[1]/a/text()').get()
item['publish_time'] = response.xpath('//article/ul[@class="post-meta"]/li[2]/time/text()').get()
item['content'] = response.xpath('//article/div[@class="post-content"]/*').getall()
item['content'] = ''.join(item['content'])
yield item
定义要爬取的字段 blogSpider/blogSpider/items.py
# -*- coding: utf-8 -*-
import scrapy
class BlogspiderItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field() # 博客标题
href = scrapy.Field() # 文章链接
post_id = scrapy.Field() # 文章ID
author = scrapy.Field() # 作者
publish_time = scrapy.Field() # 发布时间
content = scrapy.Field() # 文章详情
编写管道文件,处理爬取的数据 blogSpider/blogSpider/pipelines.py
本例选择将爬取到的数据存储到SQLite数据库,因为Python2.5及以上版本会自带SQLite3模块,使用较方便,不需要另行安装数据库软件。当然,直接导出数据到文件也是可以的,但是为了读取方便,还是数据库比较合适。
# -*- coding: utf-8 -*-
import json
import sqlite3
# 创建项目时自动生成的类
class BlogspiderPipeline(object):
def __init__(self):
self.f = open('blog.json', 'w')
def process_item(self, item, spider):
content = json.dumps(dict(item), ensure_ascii = False) + ',\n'
self.f.write(content)
return item
def close_spider(self, spider):
self.f.close()
# 自定义处理类
class SqlitePipeline(object):
def __init__(self, sqlite_db):
self.sqlite_db = sqlite_db
@classmethod
def from_crawler(cls, crawler):
return cls(
sqlite_db = crawler.settings.get('SQLITE_DB')
)
def open_spider(self, spider):
self.conn = sqlite3.connect(self.sqlite_db)
self.conn.text_factory = str
self.cx = self.conn.cursor()
# 每次爬取时清空数据表
self.cx.execute("delete from posts")
self.conn.commit
def process_item(self, item, spider):
data = dict(item)
# unicode 编码转utf-8,否则插入数据库时会乱码
for key in data.keys():
data[key] = data[key].encode('utf-8')
sql = 'insert into posts(post_id, title, href, author, content, publish_time) values(?, ?, ?, ?, ?, ?)'
self.cx.execute(sql, (data['post_id'], data['title'], data['href'], data['author'], data['content'], data['publish_time']))
self.conn.commit()
return item
def close_spider(self, spider):
self.conn.close()
修改配置文件 blogSpider/blogSpider/settings.py
本例在blog.py
里已经指定了管道类(见下图),爬虫文件中的指定会覆盖settings.py
中的顺序
# 开启使用的管道类,后面的数值越小越先执行
ITEM_PIPELINES = {
'blogSpider.pipelines.SqlitePipeline': 300,
'blogSpider.pipelines.BlogspiderPipeline': 800,
}
# 设置导出json为utf-8
FEED_EXPORT_ENCODING = 'utf-8'
# 设置sqlite文件地址
# 数据库地址需替换成自己的实际目录
SQLITE_DB = '/root/blogSpider/blogSpider/data/blog.db'
执行爬虫
scrapy crawl blog