Long Luo's Life Notes

Python爬虫练习：微博热搜爬虫，微博爬虫和主页博文爬虫

发表于 2022-05-23 更新于 2025-05-02 分类于 Python Waline：阅读次数：本文字数： 3.3k 阅读时长 ≈ 3 分钟

By Long Luo

微博热搜爬虫

前几天外甥女要我帮她完成一个小作业，用 Python 完成一个爬虫，于是在网上找了点资料¹ 和 Github 上找了个爬虫² 的例子，改简单了点，只爬取微博热搜数据并存储到 Excel 中，使用 Python Jupyter 编写。

具体代码见：https://github.com/longluo/spider/blob/master/weibohot.ipynb 。

微博爬虫

微博爬虫 ³ 的代码及功能太多太复杂，但我目前只需要爬取下列信息：

个人信息
具体时间段全部博文；
将数据存入 Excel 和数据库中。

所以需要精简其代码。

微博API

微博获取个人信息API：https://weibo.cn/1565668374/info

微博获取个人全部博文API：https://weibo.cn/1565668374?page=1

具体代码见：https://github.com/longluo/spider/blob/master/weibo.py 。

博客爬虫

在写完上述两个爬虫之后，趁热打铁，也把之前自己一直想做完的功能做完了！

功能

获取某网站的全部博文比如 https://www.longluo.me/ 。
爬取内容：文章标题、发布时间、分类、链接、正文（HTML格式）等。

分析

向目标网页发出请求

Header 是请求头信息的，添加的键值对越多，目标网站就越认为你是真实的用户，但我的网站没有反爬虫措施，所以 Header 里面就只有 User-Agent 键值对，用来描述你的操作系统、浏览器等信息。

获取网站博客的文章标题、发布时间和链接；
数据写入Excel中；
数据存入数据库中。

实现

获取全部博文标题、发布时间和链接

全部博文列表位于 https://www.longluo.me/archives/ 中，首先我们要获取的是总共有多少页呢？

总页数

查看网页源代码，我们可以看见全部页数为 <span class="space">…</span><a class="page-number" href="/archives/page/48/">48</a>，所以我们只需要解析出来即可。

参考网络代码⁴ ，利用BeautifulSoup解析HTML，然后利用正则表达式匹配，代码如下所示：

def get_total_page(self):
    html = self.get_html(self.archivesUrl)

    page_number = re.compile('[0-9]+')

    soap = BeautifulSoup(html, 'html.parser')

    page_item = soap.find_all('a', class_="page-number")
    total_pages = re.findall(page_number, str(page_item))[-1]
    return int(total_pages)

阅读全文 »

[LeetCode][125. Valid Palindrome] Simple Two Pointers Solution

发表于 2022-05-22 更新于 2025-05-02 分类于 Leetcode Waline：阅读次数：本文字数： 1.1k 阅读时长 ≈ 1 分钟

By Long Luo

This article is the solution Simple Two Pointers Solution of Problem 125. Valid Palindrome.

Very easy, we can use Two Pointers to solve this problem.

public boolean isPalindrome(String s) {
    if (s == null || s.length() <= 1) {
        return true;
    }

    int n = s.length();
    int left = 0;
    int right = n - 1;
    while (left < right) {
        while (left < n && left < right && !(Character.isDigit(s.charAt(left)) || Character.isAlphabetic(s.charAt(left)))) {
            left++;
        }

        while (right > 0 && left < right && !(Character.isDigit(s.charAt(right)) || Character.isAlphabetic(s.charAt(right)))) {
            right--;
        }

        if (Character.toUpperCase(s.charAt(left)) != Character.toUpperCase(s.charAt(right))) {
            return false;
        }

        while (left < right && Character.toUpperCase(s.charAt(left)) == Character.toUpperCase(s.charAt(right))) {
            left++;
            right--;
        }
    }

    return true;        
}

阅读全文 »

[LeetCode][3. Longest Substring Without Repeating Characters] Two Pointers: Sliding Window with HashSet

发表于 2022-05-21 更新于 2025-05-03 分类于 Leetcode Waline：阅读次数：本文字数： 924 阅读时长 ≈ 1 分钟

By Long Luo

This article is the solution Two Pointers: Sliding Window with HashSet of Problem 3. Longest Substring Without Repeating Characters.

Intuition

We can just simulate the process.

Using a Sliding Window from left to right of the string, and a HashSet to make every character is unique in the sliding window.

public static int lengthOfLongestSubstring(String s) {
    if (s == null || s.length() <= 1) {
        return s.length();
    }

    int ans = 0;
    int len = s.length();
    Set<Character> set = new HashSet<>();
    int left = 0;
    int right = 0;
    while (right < len) {
        while (right < len && !set.contains(s.charAt(right))) {
            set.add(s.charAt(right));
            right++;
        }

        ans = Math.max(ans, right - left);
        set.remove(s.charAt(left));
        left++;
    }

    return ans;
}

阅读全文 »

[LeetCode][21. Merge Two Sorted Lists] The Recursive Algorithm with Detailed Image Explanation

发表于 2022-05-20 更新于 2025-05-02 分类于 Leetcode Waline：阅读次数：本文字数： 1.4k 阅读时长 ≈ 1 分钟

By Long Luo

This article is the solution The Recursive Algorithm with Detailed Image Explanation of Problem 21. Merge Two Sorted Lists .

Intuition

We can use Recursion to solve this problem.

The key points of Recursion are \(2\).

How to terminate the recursion: Returns when either \(\texttt{l1}\) or \(\texttt{l2}\) is \(\texttt{null}\).
What to do in the process: if \(\texttt{l1.val} \le \texttt{l2.val}\), then point \(\texttt{l1.next}\) to the smaller of \(\texttt{l1}\) and \(\texttt{l2}\).

If \(\texttt{l1.val} \le \texttt{l2.val}\), we can choose the smaller node, such as \(\texttt{l1}\). However, the linked list is not reached the end, we will continue to compare.

Now we are compare \(\texttt{l1.next}\) and \(\texttt{l2}\). The \(\texttt{l1.next}\) and \(\texttt{l2}\) are processed in the recursive functions of the next layer.

We process such process and finally get the result.

阅读全文 »

[LeetCode][329. Longest Increasing Path in a Matrix] 4 Approaches: BFS, Memorization DFS, DP, Topo Sorting

发表于 2022-05-19 更新于 2025-05-02 分类于 Leetcode Waline：阅读次数：本文字数： 6.2k 阅读时长 ≈ 6 分钟

By Long Luo

This article is the solution 4 Approaches: BFS, Memorization DFS, DP, Topo Sorting of Problem 329. Longest Increasing Path in a Matrix .

Here shows 4 Approaches to slove this problem: BFS, Memorization DFS, DP, Topological Sorting.

BFS

The simplest method that comes to my mind is BFS. We start search from each node, spread out like water, to see how far it can flood, and record the maximum flood path length of all nodes.

class Solution {
    public int longestIncreasingPath(int[][] matrix) {
        if (matrix == null || matrix.length == 0 || matrix[0].length == 0) {
            return 0;
        }

        int row = matrix.length;
        int col = matrix[0].length;
        if (row == 1 && col == 1) {
            return 1;
        }

        int max = 1;
        for (int i = 0; i < row; i++) {
            for (int j = 0; j < col; j++) {
                max = Math.max(max, bfs(matrix, i, j));
            }
        }

        return max;
    }

    public int bfs(int[][] matrix, int x, int y) {
        int[][] dirs = {{-1, 0}, {1, 0}, {0, -1}, {0, 1}};

        int ans = 0;
        int row = matrix.length;
        int col = matrix[0].length;

        Queue<int[]> queue = new LinkedList<>();
        queue.offer(new int[]{x, y});

        while (!queue.isEmpty()) {
            int size = queue.size();
            ans++;

            for (int i = 0; i < size; i++) {
                int[] curPos = queue.poll();

                for (int[] dir : dirs) {
                    int nextX = curPos[0] + dir[0];
                    int nextY = curPos[1] + dir[1];

                    if (nextX < 0 || nextX >= row || nextY < 0 || nextY >= col) {
                        continue;
                    }

                    if (matrix[nextX][nextY] <= matrix[curPos[0]][curPos[1]]) {
                        continue;
                    }

                    queue.offer(new int[]{nextX, nextY});
                }
            }
        }

        return ans;
    }
}

Analysis

Time Complexity: \(O(m^2n^2)\).
Space Complexity: \(O(mn)\).

阅读全文 »