Python爬虫练习：微博热搜爬虫，微博爬虫和主页博文爬虫

发表于 2022-05-23 更新于 2023-11-14 分类于 Python Waline：阅读次数：本文字数： 3.4k 阅读时长 ≈ 3 分钟

By Long Luo

微博热搜爬虫

前几天外甥女要我帮她完成一个小作业，用 Python 完成一个爬虫，于是在网上找了点资料¹ 和 Github 上找了个爬虫² 的例子，改简单了点，只爬取微博热搜数据并存储到 Excel 中，使用 Python Jupyter 编写。

具体代码见：https://github.com/longluo/spider/blob/master/weibohot.ipynb 。

微博爬虫

微博爬虫 ³ 的代码及功能太多太复杂，但我目前只需要爬取下列信息：

个人信息
具体时间段全部博文；
将数据存入 Excel 和数据库中。

所以需要精简其代码。

微博API

微博获取个人信息API：https://weibo.cn/1565668374/info

微博获取个人全部博文API：https://weibo.cn/1565668374?page=1

具体代码见：https://github.com/longluo/spider/blob/master/weibo.py 。

博客爬虫

在写完上述两个爬虫之后，趁热打铁，也把之前自己一直想做完的功能做完了！

功能

获取某网站的全部博文比如 http://www.longluo.me/ 。
爬取内容：文章标题、发布时间、分类、链接、正文（HTML格式）等。

分析

向目标网页发出请求

Header 是请求头信息的，添加的键值对越多，目标网站就越认为你是真实的用户，但我的网站没有反爬虫措施，所以 Header 里面就只有 User-Agent 键值对，用来描述你的操作系统、浏览器等信息。

获取网站博客的文章标题、发布时间和链接；
数据写入Excel中；
数据存入数据库中。

实现

获取全部博文标题、发布时间和链接

全部博文列表位于 http://www.longluo.me/archives/ 中，首先我们要获取的是总共有多少页呢？

总页数

查看网页源代码，我们可以看见全部页数为 <span class="space">…</span><a class="page-number" href="/archives/page/48/">48</a>，所以我们只需要解析出来即可。

参考网络代码⁴ ，利用BeautifulSoup解析HTML，然后利用正则表达式匹配，代码如下所示：

def get_total_page(self):
    html = self.get_html(self.archivesUrl)

    page_number = re.compile('[0-9]+')

    soap = BeautifulSoup(html, 'html.parser')

    page_item = soap.find_all('a', class_="page-number")
    total_pages = re.findall(page_number, str(page_item))[-1]
    return int(total_pages)

博文信息

查看每页源代码，可以看出博文发布日期：

<div class="post-meta-container">
  <time itemprop="dateCreated"
        datetime="2022-05-30T07:26:25+08:00"
        content="2022-05-30">
    05-30
  </time>
</div>

博文标题及链接：

<div class="post-title">
    <a class="post-title-link" href="/blog/2022/05/30/Leetcode-divide-two-integers-en/" itemprop="url">
      <span itemprop="name">[LeetCode][29. Divide Two Integers] 5 Approaches: BF use Long, BF use Int, Binary Search use Long, Binary Search use Int, Recursion</span>
    </a>
</div>

很容易写出下面代码：

def get_page(self):
    data_list = []
    data_date = []
    data_link = []

    page_post_link = re.compile('<a class="post-title-link" href="(.*)" itemprop="url">')

    for i in range(1, self.total_pages):
        page_post_title = re.compile('<span itemprop="name">(.*)</span>')
        page_post_date = re.compile('[0-9]{4}-[0-9]{2}-[0-9]{2}')

        if i == 1:
            page_url = self.archivesUrl
        else:
            page_url = self.archivesUrl + '/page/' + str(i) + '/'

        html = self.get_html(page_url)

        soap = BeautifulSoup(html, 'html.parser')  # 用html.parser来解析该html网页

        for title_item in soap.find_all('span', itemprop="name"):
            post_title = re.findall(page_post_title, str(title_item))[0]
            data_list.append(post_title)

        for date_item in soap.find_all('time', itemprop='dateCreated'):
            post_date = re.findall(page_post_date, str(date_item))[0]
            data_date.append(post_date)

        for link_item in soap.find_all('a', class_="post-title-link"):
            post_link = re.findall(page_post_link, str(link_item))[0]
            data_link.append(self.baseUrl + post_link)

    return data_list, data_date, data_link

保存文章详情

之前我们已经获取到了文章链接列表，列表循环，然后返回响应数据，使用 CSS 选择器匹配具体文章内容，再将文章内容都保存到 markdown 文件中。

存储文章标题、发布时间和文章链接到Excel和数据库中

存储到 Excel 可以直接使用 Pandas 的 to_csv ；
数据库
- 初始化数据库 init_db()
- 将数据存储到 save_data_2_db()

具体代码见：https://github.com/longluo/spider/blob/master/blog.py 。