Scrapy 实现博客爬虫实例

目标网址:http://i.csensix.com (即本站)
项目地址:blogSpider

系统环境

  1. CentOS 7.5
  2. Python 2.7.16
  3. Scrapy 1.7.3

实现过程

创建项目

scrapy startproject blogSpider
执行完上面的命令,会生成如下目录结构:

blogSpider/
    scrapy.cfg              # 配置文件
    blogSpider/             # 主要代码目录
        __init__.py
        items.py            # 定义项目items
        middlewares.py      # 项目中间件文件
        pipelines.py        # 项目管道文件
        settings.py         # 项目配置
        spiders/            # 爬虫目录
            __init__.py

编写爬虫 blogSpider/blogSpider/spiders/blog.py

对于html的解析,这里主要用的是.xpath()方法,Scrapy还提供了.css()方法可以使用,至于用哪一种,看个人习惯
# -*- coding: utf-8 -*-

# 下面三句作用:防止中文出现乱码
import sys
reload(sys)
sys.setdefaultencoding('utf8')

import scrapy
from blogSpider.items import BlogspiderItem

class BlogSpider(scrapy.Spider):
    name = "blog"
    allowed_domains = ['csensix.com']
    start_urls = [
        'http://i.csensix.com/',
    ]

    # 为spider指定pipeline,会覆盖settings.py里面的设置
    custom_settings = {
        'ITEM_PIPELINES': {'blogSpider.pipelines.SqlitePipeline': 300,}
    }

    def parse(self, response):
        article_list = response.xpath('//article[@class="post"]')
        for article in article_list:
            # 爬取标题链接
            href = article.xpath('./h2/a/@href').get()
            
            # 爬取详情页
            yield scrapy.Request(
                href,
                callback = self.parse_detail
            )
        
        # 翻页
        next_url = response.xpath('//div[@id="main"]/ol/li[@class="next"]/a/@href').get()
        if next_url is not None:
            yield scrapy.Request(next_url, callback = self.parse)

    def parse_detail(self, response):
        item = BlogspiderItem()
        item['title'] = response.xpath('//article/h1[@class="post-title"]/a/text()').get()
        item['href'] = response.xpath('//article/h1[@class="post-title"]/a/@href').get()
        item['post_id'] = item['href'].split('/')[-2]
        item['author'] = response.xpath('//article/ul[@class="post-meta"]/li[1]/a/text()').get()
        item['publish_time'] = response.xpath('//article/ul[@class="post-meta"]/li[2]/time/text()').get()
        item['content'] = response.xpath('//article/div[@class="post-content"]/*').getall()
        item['content'] = ''.join(item['content'])
        yield item

定义要爬取的字段 blogSpider/blogSpider/items.py

# -*- coding: utf-8 -*-

import scrapy

class BlogspiderItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()           # 博客标题
    href = scrapy.Field()            # 文章链接
    post_id = scrapy.Field()         # 文章ID
    author = scrapy.Field()          # 作者
    publish_time = scrapy.Field()    # 发布时间
    content = scrapy.Field()         # 文章详情

编写管道文件,处理爬取的数据 blogSpider/blogSpider/pipelines.py

本例选择将爬取到的数据存储到SQLite数据库,因为Python2.5及以上版本会自带SQLite3模块,使用较方便,不需要另行安装数据库软件。当然,直接导出数据到文件也是可以的,但是为了读取方便,还是数据库比较合适。
# -*- coding: utf-8 -*-

import json
import sqlite3

# 创建项目时自动生成的类
class BlogspiderPipeline(object):
    def __init__(self):
        self.f = open('blog.json', 'w')

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii = False) + ',\n'
        self.f.write(content)
        return item
    
    def close_spider(self, spider):
        self.f.close()


# 自定义处理类
class SqlitePipeline(object):
    def __init__(self, sqlite_db):
        self.sqlite_db = sqlite_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            sqlite_db = crawler.settings.get('SQLITE_DB')
        )

    def open_spider(self, spider):
        self.conn = sqlite3.connect(self.sqlite_db)
        self.conn.text_factory = str
        self.cx = self.conn.cursor()
        # 每次爬取时清空数据表
        self.cx.execute("delete from posts")
        self.conn.commit

    def process_item(self, item, spider):
        data = dict(item)
        
        # unicode 编码转utf-8,否则插入数据库时会乱码
        for key in data.keys():
            data[key] = data[key].encode('utf-8')
        
        sql = 'insert into posts(post_id, title, href, author, content, publish_time) values(?, ?, ?, ?, ?, ?)'

        self.cx.execute(sql, (data['post_id'], data['title'], data['href'], data['author'], data['content'], data['publish_time']))
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.conn.close()

修改配置文件 blogSpider/blogSpider/settings.py

本例在blog.py里已经指定了管道类(见下图),爬虫文件中的指定会覆盖settings.py中的顺序
微信截图_20191029153905.jpg
# 开启使用的管道类,后面的数值越小越先执行
ITEM_PIPELINES = {
    'blogSpider.pipelines.SqlitePipeline': 300,
    'blogSpider.pipelines.BlogspiderPipeline': 800,
}

# 设置导出json为utf-8
FEED_EXPORT_ENCODING = 'utf-8'

# 设置sqlite文件地址
# 数据库地址需替换成自己的实际目录
SQLITE_DB = '/root/blogSpider/blogSpider/data/blog.db'

执行爬虫

scrapy crawl blog

标签: python, sqlite3, Scrapy, 爬虫

添加新评论