Python基于百度AI實現(xiàn)抓取表情包

更新時間：2021年06月27日 09:42:00 作者：Amo Xiang

本文先抓取網(wǎng)絡(luò)上的表情圖像，然后利用百度 AI 識別表情包上的說明文字，并利用表情文字重命名文件，感興趣的小伙伴們可以參考一下

一、百度 AI 開放平臺的 Key 申請方法

本例使用了百度 AI 的 API 接口實現(xiàn)文字識別。因此需要先申請對應(yīng)的 API 使用權(quán)限，具體步驟如下：

在網(wǎng)頁瀏覽器(比如 Chrome 或者火狐) 的地址欄中輸入 ai.baidu.com，進入到百度云 AI 的官網(wǎng)，在該頁面中單擊右上角的 控制臺 按鈕。

在這里插入圖片描述

進入到百度云 AI 官網(wǎng)的登錄頁面，輸入百度賬號和密碼，如果沒有，可以單擊 立即注冊 超鏈接進行注冊申請。

登錄成功后，進入到百度云 AI 官網(wǎng)的控制臺頁面，單擊左側(cè)導(dǎo)航的 產(chǎn)品服務(wù)，展開列表，在列表的最右側(cè)下方看到有 人工智能 的分類，然后選擇 圖像識別，或者直接選擇 文字識別，如下圖所示。

在這里插入圖片描述

進入圖像識別一概覽 頁面，要使用百度云 AI 的 API，首先需要申請權(quán)限，申請權(quán)限之前需要先創(chuàng)建自己的應(yīng)用，因此單擊 創(chuàng)建應(yīng)用按鈕，如下圖所示。

在這里插入圖片描述

進入到 創(chuàng)建應(yīng)用 頁面，該頁面中需要輸入應(yīng)用的名稱，選擇應(yīng)用類型，并選擇接口，注意：這里的接口可以多選擇一些，把后期可能用到的接口全部選擇上，這樣，在開發(fā)其他實例時，就可以直接使用了；選擇完接口后，選擇文字識別包名，這里選擇 不需要，輸入應(yīng)用描述，單擊 立即創(chuàng)建 按鈕，如下圖所示。

在這里插入圖片描述

創(chuàng)建完成后，單擊 返回應(yīng)用列表 按鈕，頁面跳轉(zhuǎn)到應(yīng)用列表頁面，在該頁面中即可查看創(chuàng)建的應(yīng)用，以及百度云自動為您分配的 AppID，API Key，Secret Key，這些值根據(jù)應(yīng)用的不同而不同，因此一定要保存好，以便開發(fā)時使用。

在這里插入圖片描述

二、抓取貼吧表情包

本例在百度貼吧中找到了一些自制的表情包：https://tieba.baidu.com/p/5522091060
現(xiàn)在想把圖片都爬下來，具體操作步驟如下：

Network 抓包看下返回的數(shù)據(jù)是否和 Element 一致，即是否包含想要的數(shù)據(jù)，而不是通過 JS 黑魔法進行加載的。復(fù)制下第一個圖的圖片鏈接，到 Network 選項卡里的 Response 里查找一下。

在這里插入圖片描述

在 Network 抓包中沒有發(fā)現(xiàn) Ajax 動態(tài)加載數(shù)據(jù)的蹤跡。

點擊第二頁，抓包發(fā)現(xiàn)了 Ajax 加載的痕跡。

在這里插入圖片描述

以第一個圖的 url 搜下，同樣可以找到。

三個參數(shù)猜測 pn 為 page_number，即頁數(shù)，postman 或者自己寫代碼模擬請求，記得塞入 Host 和 X-Requested-With，驗證 pn=1 是否為第一頁數(shù)據(jù)，驗證通過，即所有頁面數(shù)據(jù)都可以通過這個接口拿到。

先加載拿到末頁是第幾頁，然后走一波循環(huán)遍歷即可解析數(shù)據(jù)獲得圖片 url，寫入文件，使用多個線程進行下載，詳細代碼如下。

# 抓取百度貼吧某個帖子里的所有圖片
import requests
import time
import threading
import queue
from bs4 import BeautifulSoup
import chardet
import os

tiezi_url = "https://tieba.baidu.com/p/5522091060"
headers = {
    'Host': 'tieba.baidu.com',
    'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KH'
                  'TML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
}
pic_save_dir = 'tiezi_pic/'
if not os.path.exists(pic_save_dir):  # 判斷文件夾是否存在，不存在就創(chuàng)建
    os.makedirs(pic_save_dir)

pic_urls_file = 'tiezi_pic_urls.txt'
download_q = queue.Queue()  # 下載隊列


# 獲得頁數(shù)
def get_page_count():
    try:
        resp = requests.get(tiezi_url, headers=headers, timeout=5)
        if resp is not None:
            resp.encoding = chardet.detect(resp.content)['encoding']
            html = resp.text
            soup = BeautifulSoup(html, 'lxml')
            a_s = soup.find("ul", attrs={'class': 'l_posts_num'}).findAll("a")
            for a in a_s:
                if a.get_text() == '尾頁':
                    return a['href'].split('=')[1]
    except Exception as e:
        print(str(e))


# 下載線程
class PicSpider(threading.Thread):
    def __init__(self, t_name, func):
        self.func = func
        threading.Thread.__init__(self, name=t_name)

    def run(self):
        self.func()


# 獲得每頁里的所有圖片URL
def get_pics(count):
    params = {
        'pn': count,
        'ajax': '1',
        't': int(time.time())
    }
    try:
        resp = requests.get(tiezi_url, headers=headers, timeout=5, params=params)
        if resp is not None:
            resp.encoding = chardet.detect(resp.content)['encoding']
            html = resp.text
            soup = BeautifulSoup(html, 'lxml')
            imgs = soup.findAll('img', attrs={'class': 'BDE_Image'})
            for img in imgs:
                print(img['src'])
                with open(pic_urls_file, 'a') as fout:
                    fout.write(img['src'])
                    fout.write('\n')
            return None
    except Exception:
        pass


# 下載線程調(diào)用的方法
def down_pics():
    global download_q
    while not download_q.empty():
        data = download_q.get()
        download_pic(data)
        download_q.task_done()


# 下載調(diào)用的方法
def download_pic(img_url):
    try:
        resp = requests.get(img_url, headers=headers, timeout=10)
        if resp.status_code == 200:
            print("下載圖片:" + img_url)
            pic_name = img_url.split("/")[-1][0:-1]
            with open(pic_save_dir + pic_name, "wb+") as f:
                f.write(resp.content)

    except Exception as e:
        print(e)


if __name__ == '__main__':
    print("檢索判斷鏈接文件是否存在：")
    if not os.path.exists(pic_urls_file):
        print("不存在，開始解析帖子...")
        page_count = get_page_count()
        if page_count is not None:
            headers['X-Requested-With'] = 'XMLHttpRequest'
            for page in range(1, int(page_count) + 1):
                get_pics(page)
        print("鏈接已解析完畢！")
        headers.pop('X-Requested-With')
    else:
        print("存在")
    print("開始下載圖片~~~~")
    headers['Host'] = 'imgsa.baidu.com'
    fo = open(pic_urls_file, "r")
    pic_list = fo.readlines()

    threads = []
    for pic in pic_list:
        download_q.put(pic)
    for i in range(0, len(pic_list)):
        t = PicSpider(t_name='線程' + str(i), func=down_pics)
        t.daemon = True
        t.start()
        threads.append(t)
    download_q.join()
    for t in threads:
        t.join()
    print("圖片下載完畢")

運行結(jié)果：

在這里插入圖片描述

下面通過 OCR 文字識別技術(shù)，直接把表情里的文字提出來，然后來命名圖片，這樣就可以直接文件搜索表情關(guān)鍵字，可以快速找到需要的表情圖片。使用谷歌的 OCR 文字識別引擎：Tesseract，對于此類大圖片小文字，不太適合，識別率太低，甚至無法識別，這時使用百度云 OCR 比較合適，它能夠自動定位到圖片中具體位置，并找出圖片中所有的文字。

三、使用 Baidu-aip

申請百度 AI 的應(yīng)用 key 之后，就可以在本地系統(tǒng)中安裝 Baidu-aip，代碼如下：

pip install baidu-aip

先識別一張圖片，看看效果如何：

from aip import AipOcr

# 新建一個AipOcr對象
config = {
    'appId': '填寫自己的appId',
    'apiKey': '填寫自己的apiKey',
    'secretKey': '填寫自己的secretKey'
}
client = AipOcr(**config)


# 識別圖片里的文字
def img_to_str(image_path):
    # 讀取圖片
    with open(image_path, 'rb') as fp:
        image = fp.read()

        # 調(diào)用通用文字識別, 圖片參數(shù)為本地圖片
    result = client.basicGeneral(image)
    # 返回拼接結(jié)果
    if 'words_result' in result:
        return '\n'.join([w['words'] for w in result['words_result']])


if __name__ == '__main__':
    print(img_to_str('tiezi_pic/5c0ddb1e4134970aebd593e29ecad1c8a5865dbd.jpg'))

運行程序，結(jié)果如下圖所示：

在這里插入圖片描述

百度 AI 返回的是一個 JSON 格式數(shù)據(jù)，如下所示。返回一個字典對象，包含 log_id、words_result_num、words_result 三個鍵，其中 words_result_num 表示識別的文本行數(shù)，words_result 是一個列表，每個列表項目記錄一條識別的文本，每個項目返回一個字典對象，包含 words 鍵，words 表示識別的文本。

{'words_result': [{'words': 'o。o'}, {'words': '6226-16:59'}, {'words': '絕望jpg'}], 'log_id': 1393611954748129280, 'words_result_num': 3}
o。o
6226-16:59
絕望jpg

由于每個圖片中可能包含很多文字信息，如水印的日期文字，以及個別特殊的文字符號被誤解析，我們需要提出的是漢字或字母信息，同時可能會包含多條漢字信息，本例選擇漢字或字母最長的一條來命名文件。完整的示例代碼如下：

# 識別圖片文字，批量命名圖片文字

import os
from aip import AipOcr
import re
import datetime

# 新建一個AipOcr對象
config = {
    'appId': '填寫自己的appId',
    'apiKey': '填寫自己的apiKey',
    'secretKey': '填寫自己的secretKey'
}
client = AipOcr(**config)

pic_dir = r"tiezi_pic/"


# 讀取圖片
def get_file_content(file_path):
    with open(file_path, 'rb') as fp:
        return fp.read()


# 識別圖片里的文字
def img_to_str(image_path):
    image = get_file_content(image_path)
    # 調(diào)用通用文字識別, 圖片參數(shù)為本地圖片
    result = client.basicGeneral(image)
    # 結(jié)果拼接返回
    words_list = []
    if 'words_result' in result:
        if len(result['words_result']) > 0:
            for w in result['words_result']:
                words_list.append(w['words'])
            file_name = get_longest_str(words_list)
            print(file_name)
            file_dir_name = pic_dir + str(file_name).replace("/", "") + '.jpg'
            if os.path.exists(file_dir_name):  # 處理文件重名問題
                sec = datetime.datetime.now().microsecond  # 獲取當前毫秒時值
                file_dir_name = pic_dir + str(file_name).replace("/", "") + str(sec) + '.jpg'
            try:
                os.rename(image_path, file_dir_name)
            except Exception:
                print(" 重命名失?。?, image_path, " => ", file_name)


# 獲取字符串列表中最長的字符串
def get_longest_str(str_list):
    pat = re.compile(r'[\u4e00-\u9fa5A-Za-z]+')
    str = max(str_list, key=hanzi_len)
    result = pat.findall(str)
    return ''.join(result)


def hanzi_len(item):
    pat = re.compile(r'[\u4e00-\u9fa5]+')
    sum = 0
    for i in item:
        if pat.search(i):
            sum += 1
    return sum


# 遍歷某個文件夾下所有圖片
def query_picture(dir_path):
    pic_path_list = []
    for filename in os.listdir(dir_path):
        pic_path_list.append(dir_path + filename)
    return pic_path_list


if __name__ == '__main__':
    pic_list = query_picture(pic_dir)
    if len(pic_list) > 0:
        for i in pic_list:
            img_to_str(i)

運行程序，結(jié)果如下圖所示：

在這里插入圖片描述