基于Python實(shí)現(xiàn)的通用小規(guī)模搜索引擎

更新時(shí)間：2025年01月12日 15:58:29 作者：神仙別鬧

這篇文章主要介紹了基于Python實(shí)現(xiàn)的通用小規(guī)模搜索引擎,文中代碼示例和圖文結(jié)合的方式講解的非常詳細(xì),對(duì)大家的學(xué)習(xí)或工作有一定的幫助,需要的朋友可以參考下

1.項(xiàng)目簡(jiǎn)介

1.1背景

《信息內(nèi)容安全》網(wǎng)絡(luò)信息內(nèi)容獲取技術(shù)課程項(xiàng)目設(shè)計(jì)

一個(gè)至少能支持10個(gè)以上網(wǎng)站的爬蟲程序，且支持增量式數(shù)據(jù)采集;并至少采集10000個(gè)實(shí)際網(wǎng)頁;
針對(duì)采集回來的網(wǎng)頁內(nèi)容，能夠?qū)崿F(xiàn)網(wǎng)頁文本的分類;
可進(jìn)行重復(fù)或冗余網(wǎng)頁的去重過濾;
對(duì)經(jīng)去冗以后的內(nèi)容建立倒排索引;
采用PageRank算法實(shí)現(xiàn)搜索結(jié)果的排序;
支持自然語言的模糊檢索;
可實(shí)現(xiàn)搜索結(jié)果的可視化呈現(xiàn)。
可以在線記錄每次檢索的日志，井可對(duì)日志數(shù)據(jù)進(jìn)統(tǒng)計(jì)分析和關(guān)聯(lián)挖掘。

1.2運(yùn)行環(huán)境

平臺(tái)：全平臺(tái)
jdk 1.8.0
ElasticSearch 7.4.0
Python 3.6 及以上

安裝依賴模塊

PageRank算法、AI文本分類與上傳

> pip install paddlepaddle numpy elasticsearch

數(shù)據(jù)的爬取與預(yù)處理

> pip install requests bs4

1.3運(yùn)行步驟

安裝配置ElasticSearch并啟動(dòng)

下載并解壓Elasticsearch，詳細(xì)步驟自行搜索

- 可以從 apt 和 yum 的軟件倉庫安裝，也可以使用 Windows MSI 安裝包安裝

安裝 IK 中文分詞器，詳細(xì)步驟自行搜索
創(chuàng)建索引

PUT http://127.0.0.1/page
{
    "settings": {
        "number_of_shards": "5",
        "number_of_replicas": "0"
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "ik_max_word"
            },
            "weight": {
                "type": "double"
            },
            "content" : {
                "type" : "text",
                "analyzer": "ik_max_word"
            },
            "content_type": {
                "type": "text"
            },
            "url": {
                "type": "text",
                "analyzer": "ik_max_word"
            },
            "update_date": {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
            }
        }
    }
}

啟動(dòng) ElasticSearch ，在 bash 中執(zhí)行 bin/elasticsearch 或者在 Windows 的 cmd、powershell 執(zhí)行 bin\elasticsearch.bat

啟動(dòng)Web服務(wù)

> cd WebApp
> java -jar *.jar

數(shù)據(jù)的爬取與預(yù)處理

> cd DataCrawler
> python crawler.py

計(jì)算PageRank值

> cd DataProcess
> python PageRank.py

利用AI進(jìn)行文本分類并上傳至ES

> cd DataProcess/Text_Classification
> python Classify.py

2.需求分析

2.1數(shù)據(jù)描述

2.1.1 靜態(tài)數(shù)據(jù)

變量名	描述
`thread_accoun`	線程個(gè)數(shù)
`initial_url`	種子頁面

2.1.2 動(dòng)態(tài)數(shù)據(jù)

變量名	描述	類型
`restricted_domain`	限定域名	列表
`banned_domain`	禁止域名	列表
`thread_account`	線程個(gè)數(shù)	整型
`total_pages`	限定頁面?zhèn)€數(shù)	整型

2.1.3索引數(shù)據(jù)字典

頁面（page）信息索引：

數(shù)據(jù)項(xiàng)名稱	含義	別名	類型	備注
`title`	網(wǎng)站標(biāo)題		`text`	使用`ik_max_word` 分詞
`weight`	PageRank值	pr值，PR值	`double`
`content`	網(wǎng)站中的內(nèi)容		`text`	使用`ik_max_word` 分詞
`content_type`	網(wǎng)站中的內(nèi)容分類		`text`	文化, 娛樂, 體育, 財(cái)經(jīng), 房產(chǎn), 汽車, 教育, 科技, 國際, 證券
`url`	網(wǎng)站的鏈接		`text`	使用`ik_max_word` 分詞
`update_date`	數(shù)據(jù)更新的時(shí)間		`date`	`yyyy-MM-dd HH:mm:ss` \|\|`yyyy-MM-dd` \|\|`epoch_millis`

2.2. 數(shù)據(jù)采集

種子 url 數(shù)據(jù)從 init_url 列表中選取，并按照順序，依次以各個(gè) url 為起點(diǎn)進(jìn)行遞歸的數(shù)據(jù)采集

爬取數(shù)據(jù)的url需要限制在 restricted_url 列表里面

2.3功能需求

2.3.1 數(shù)據(jù)爬取與預(yù)處理功能

利用Python爬蟲，執(zhí)行以下步驟：

開始
選取一個(gè)鏈接作為起點(diǎn)
如果爬取的網(wǎng)頁總數(shù)達(dá)到要求，則結(jié)束，否則執(zhí)行第 4 步
爬取指定鏈接的相關(guān)信息，并獲取當(dāng)前網(wǎng)站中的所有鏈接
對(duì) 4 中獲取的網(wǎng)站中的所有鏈接中的每一條數(shù)據(jù)，執(zhí)行過程3

爬取網(wǎng)站如下信息，

title
content
content_type
update_date
url
link（當(dāng)前網(wǎng)站中包含的所有鏈接，用于計(jì)算pr值）

2.3.2. 計(jì)算 PageRank 功能

根據(jù)link計(jì)算爬取下來的每個(gè)網(wǎng)站的PageRank值，迭代次數(shù)為50次。解決pr值呈周期性變化的問題。將pr值作為網(wǎng)站重要程度的指標(biāo)，并補(bǔ)充到網(wǎng)站信息中

2.3.3. AI 文本分類并提交到 ES 功能

利用深度學(xué)習(xí)，分析每個(gè)頁面的content的類別。將類別補(bǔ)充到網(wǎng)站信息中，同時(shí)刪除網(wǎng)站信息中不再使用的link項(xiàng)，形成最終數(shù)據(jù)，并上傳至ES，供用戶交互功能調(diào)用。

2.3.4. 用戶交互功能

設(shè)計(jì)WebApp，用戶通過瀏覽器訪問頁面。用戶提交搜索信息后，判斷合法性，不合法則返回ERROR界面提示用戶。如果合法，則后端代碼從本地 ES 中查詢數(shù)據(jù)，處理后將結(jié)果分條顯示到前端。同時(shí)通過限制單個(gè)ip每分鐘的訪問次數(shù)來簡(jiǎn)單防御用戶惡意搜索。

2.4. 性能需求

2.4.1. 數(shù)據(jù)精確度

對(duì)數(shù)據(jù)精確度要求不高，主要數(shù)據(jù)為：

項(xiàng)目	限制
爬取的數(shù)據(jù)總量	每小時(shí)查詢一下數(shù)據(jù)總量
查詢結(jié)果數(shù)量	匹配的所有結(jié)果數(shù)
數(shù)據(jù)更新日期	精確到分鐘即可

2.4.2. 時(shí)間特性

項(xiàng)目	限制
每爬取 1 萬個(gè)網(wǎng)頁耗時(shí)	30 分鐘以內(nèi)
計(jì)算 1 萬個(gè)網(wǎng)頁的pr值耗時(shí)	10 分鐘以內(nèi)
對(duì) 1 萬個(gè)網(wǎng)頁內(nèi)容進(jìn)行AI 進(jìn)行文本分類并上傳至ES耗時(shí)	10 分鐘以內(nèi)
Web 首頁打開耗時(shí)	5 秒以內(nèi)
查詢結(jié)果頁面打開耗時(shí)	5 秒以內(nèi)

2.5. 運(yùn)行需求

2.5.1. 用戶界面

用戶通過瀏覽器訪問，有兩個(gè)頁面，一個(gè)是主頁，只有簡(jiǎn)單的輸入框提供用戶搜索；另一個(gè)是一般界面，提供高級(jí)搜索功能，并顯示搜索結(jié)果。

2.5.2. 主頁

控件	作用	布局
圖標(biāo)	顯示Logo	居中

2.5.3. 搜索結(jié)果界面

該界面分為三個(gè)部分，導(dǎo)航條、搜索結(jié)果、信息展示。這三個(gè)部分布局如下

部分	位置	height	width
導(dǎo)航條	頂部	50px	100%
搜索結(jié)果	導(dǎo)航條左下部	auto	70%
信息展示	導(dǎo)航條右下部	auto	30%

導(dǎo)航條部分

以下控件從左向右依次（順序可以任意）在導(dǎo)航條中排列

控件	作用
輸入框	接收用戶輸入的關(guān)鍵字
輸入框	可以輸入域名，將搜索結(jié)果限制在該域名內(nèi)
數(shù)字輸入框	查詢結(jié)果分頁顯示，該框指示跳轉(zhuǎn)到指定的搜索結(jié)果頁
選擇框	允許用戶選擇匹配方式：標(biāo)題和內(nèi)容（默認(rèn)）、僅標(biāo)題、僅內(nèi)容
選擇框	選擇搜索結(jié)果的排序方式：倒排索引（默認(rèn)）、 PageRank 排序
按鈕	提交用戶輸入的所有數(shù)據(jù)，并返回搜索結(jié)果

搜索結(jié)果部分

將搜索結(jié)果以list的形式展示出來，每個(gè)list item顯示匹配的網(wǎng)站的如下數(shù)據(jù)

標(biāo)題
內(nèi)容
url
類別
PageRank值
更新時(shí)間

在list結(jié)尾，顯示分頁組件，使用戶可以點(diǎn)擊跳轉(zhuǎn)，樣式如下：

信息展示部分

展示一些必要信息，如：

本次查詢耗時(shí)
查詢結(jié)果數(shù)
數(shù)據(jù)庫中的數(shù)據(jù)總數(shù)
等等

2.5.4 軟件接口

接口名	描述	所在模塊	調(diào)用方式
`init_first_time()`	初次啟動(dòng)調(diào)用此接口	`crawler.py`	內(nèi)部調(diào)用
`get_result(url)`	得到目標(biāo) url 的頁面	`crawler.py`	內(nèi)部調(diào)用
`spider_thread()`	爬蟲線程	`crawler.py`	內(nèi)部調(diào)用
`main()`	主任務(wù)執(zhí)行線程	`crawler.py`	`crawler.main()`
`init()`	去掉所有未在 url 中出現(xiàn)的 link 及錯(cuò)誤文件	`PageRank.py`	內(nèi)部調(diào)用
`Rank(Value, start)`	計(jì)算PageRank	`PageRank.py`	內(nèi)部調(diào)用
`run()`	程序運(yùn)行方法	`PageRank.py`	`PageRank.run()`
`get_data(sentence)`	獲取已爬取數(shù)據(jù)	`Classify.py`	內(nèi)部調(diào)用
`batch_reader(json_list,json_path)`	利用AI進(jìn)行文本分類	`Classify.py`	`Classify.batch_reader()`

2.5.5. 故障處理

各個(gè)功能模塊如果出問題，會(huì)出現(xiàn)以下情況：

模塊	出故障后	簡(jiǎn)單排查
爬蟲	數(shù)據(jù)不再更新	檢查網(wǎng)絡(luò)，檢查內(nèi)存資源是否不足
PageRank計(jì)算	數(shù)據(jù)不再更新	檢查內(nèi)存資源和CPU資源是否不足
AI 文本分類	數(shù)據(jù)不再更新	檢查內(nèi)存資源和CPU資源是否不足
ElasticSearch	前端無法獲取查詢結(jié)果	問題比較復(fù)雜
WebApp	無法訪問網(wǎng)站	問題比較復(fù)雜

其中，后兩個(gè)模塊出問題會(huì)造成嚴(yán)重問題，如果重啟不能解決問題的話，采用如下措施

模塊	故障排除	終極方法
ElasticSearch	①java環(huán)境是否正確 ②是否開啟了9200端口 ③9200端口是否被占用 ④插件是否出錯(cuò) ⑤機(jī)器資源是否不足	在其他機(jī)器上部署，并修改WebApp使其到該機(jī)器上獲取服務(wù)
WebApp	①端口是否被占用 ②java環(huán)境是否正確 ③ElasticSearch是否正常運(yùn)行 ④機(jī)器資源是否不足	在其他機(jī)器上部署，并修改域名解析，將域名解析到新機(jī)器上

2.6. 其他需求

2.6.1. 可維護(hù)性

網(wǎng)絡(luò)爬蟲設(shè)置了黑名單和白名單，可以限制爬取的范圍。
各個(gè)功能分離開，協(xié)同工作。同時(shí)，只要不修改數(shù)據(jù)格式，各個(gè)模塊的修改不會(huì)影響其他模塊

2.6.2. 可移植性

WebApp 使用 Spring boot 框架開發(fā)，打包后只有一個(gè)jar包，可以在任何有java環(huán)境的機(jī)器上部署
其他功能都用python實(shí)現(xiàn)，可以部署在任何有python環(huán)境的機(jī)器上
ElasticSearch 支持分布式部署，可以部署在任意平臺(tái)

2.6.3. 數(shù)據(jù)完整性

ElasticSearch 支持分布式，會(huì)自動(dòng)將數(shù)據(jù)備份在不同節(jié)點(diǎn)。如果某個(gè)節(jié)點(diǎn)出了故障，不會(huì)破壞數(shù)據(jù)，也不會(huì)影響程序的查詢結(jié)果

3.代碼展示

import os
import sys
import json
import numpy as np
import time
import codecs
 
dir_path = os.path.split(os.path.realpath(sys.argv[0]))[0] + '/../RawData'
 
print(dir_path)
Vexname = list(os.listdir(dir_path))
Vexnum = len(Vexname)
epoch = 50
 
# 初始化，去掉所有未在url中出現(xiàn)的link以及錯(cuò)誤文件
def init():
    global Vexnum
    falsefiles={}
    idx=0
    start = time.perf_counter()
    for file in Vexname:
        if idx % 100 == 0:
            a = '=' * int(idx / Vexnum * 100)
            b = ' ' * (100 - int(idx / Vexnum * 100))
            c = int(idx / Vexnum * 100)
            dur = time.perf_counter() - start
            sys.stdout.write("\r{:^3.0f}%[{}=>{}]{:.2f}s".format(c, a, b, dur))
            sys.stdout.flush()
        with codecs.open(os.path.join(dir_path, file), 'r', encoding='utf-8') as load_f:
            try:
                text = json.load(load_f)
            except:
                falsefiles[file]=Vexname.index(file)-len(falsefiles)
                continue
            try:
                links = []
                for link in text['link']:
                    if link+'.json' in Vexname:
                        links.append(link)
                text['link'] = links.copy()
            except:
                pass
            finally:
                if 'link' in text:
                    text['link'].clear()
                else:
                    text['link'] = []
        with codecs.open(os.path.join(dir_path, file), 'w', encoding='utf-8') as dump_f:
            json.dump(text, dump_f, ensure_ascii=False,indent=4)
        idx += 1
    print('正在刪除錯(cuò)誤文件及鏈接...')
    Vexnum -= len(falsefiles)
    checknum=0
    checkfalse=0
    for file in list(falsefiles.keys()):
        os.remove(os.path.join(dir_path,file))
        Vexname.remove(file)
        for i in range(checknum,falsefiles[file]):
            with codecs.open(os.path.join(dir_path, Vexname[i]), 'r', encoding='utf-8') as load_f:
                text = json.load(load_f)
                try:
                    for falsefile in list(falsefiles.keys())[checkfalse:]:
                        if falsefile in text['link']:
                            text['link'].remove(falsefile)
                except:
                    text['link'].clear()
            with codecs.open(os.path.join(dir_path, Vexname[i]), 'w', encoding='utf-8') as dump_f:
                json.dump(text, dump_f, ensure_ascii=False,indent=4)
        checknum += falsefiles[file]
        checkfalse += 1
 
# 計(jì)算PageRank
def Rank(Value, start):
    NewValue=np.zeros(Vexnum,dtype=np.double)
    for iter in range(1,epoch):
        a = '=' * int(iter / epoch * 100)
        b = ' ' * (100 - int(iter / epoch * 100))
        c = int(iter / epoch * 100)
        dur = time.perf_counter() - start
        sys.stdout.write("\r{:^3.0f}%[{}=>{}]{:.2f}s".format(c, a, b, dur))
        sys.stdout.flush()
        for i in range(Vexnum):
            with open(os.path.join(dir_path, Vexname[i]), 'r', encoding='utf-8') as load_f:
                text = json.load(load_f)
 
                count = len(text['link'])
 
                if count == 0:
                    NewValue[i] = Value[i]
                    continue
                for link in text['link']:
                    link += '.json'
                    NewValue[Vexname.index(link)] += Value[i] / count
        for i in range(Vexnum):
            NewValue[i] = NewValue[i] / (iter + 1) + Value[i] * (iter / (iter + 1))
        Value=NewValue.copy()
    return Value
 
 
def run():
    print('開始計(jì)算PageRank...')
    print('數(shù)據(jù)初始化...')
    init()
    Value = np.ones(len(Vexname),dtype=np.double)*(1000.0/Vexnum)
    print('錯(cuò)誤文件刪除完畢！')
    print('正在計(jì)算PageRank(迭代次數(shù){})...'.format(epoch))
    start = time.perf_counter()
    Value = Rank(Value, start)
    a = '=' * 100
    b = ' ' * 0
    c = 100
    dur = time.perf_counter() - start
    sys.stdout.write("\r{:^3.0f}%[{}=>{}]{:.2f}s".format(c, a, b, dur))
    sys.stdout.flush()
    print('\nPageRank計(jì)算完畢，正在往JSON中寫入數(shù)據(jù)...')
    max = {}
    for file in Vexname:  # 將PageRank寫入JSON
        with open(os.path.join(dir_path, file), 'r', encoding='utf-8') as load_f:
            text = json.load(load_f)
        with open(os.path.join(dir_path, file), 'w', encoding='utf-8') as dump_f:
            text['weight'] = Value[Vexname.index(file)]
            max[file] = text['weight']
            json.dump(text, dump_f, ensure_ascii=False,indent=4)
    print('數(shù)據(jù)寫入完畢...')
 
 
if __name__ == '__main__':
    run()

# 導(dǎo)入必要的包
import json
import os
import sys
import time
import math
import gc
 
import elasticsearch
import numpy as np
import paddle.fluid as fluid
 
dir_path = os.path.dirname(os.path.realpath(__file__))
# 用訓(xùn)練好的模型進(jìn)行預(yù)測(cè)并輸出預(yù)測(cè)結(jié)果
# 創(chuàng)建執(zhí)行器
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
 
save_path = os.path.join(dir_path, 'infer_model/')
 
# 從模型中獲取預(yù)測(cè)程序、輸入數(shù)據(jù)名稱列表、分類器
[infer_program, feeded_var_names, target_var] = fluid.io.load_inference_model(dirname=save_path, executor=exe)
 
# 主機(jī)
host = "py7hon.com:9200"
 
# 建立 elasticsearch 連接
try:
    es = elasticsearch.Elasticsearch(hosts=host)
except Exception as e:
    print(e)
    exit()
 
 
# 獲取數(shù)據(jù)
def get_data(sentence):
    # 讀取數(shù)據(jù)字典
    with open(os.path.join(dir_path, 'dict_txt.txt'), 'r', encoding='utf-8') as f_data:
        dict_txt = eval(f_data.readlines()[0])
    dict_txt = dict(dict_txt)
    # 把字符串?dāng)?shù)據(jù)轉(zhuǎn)換成列表數(shù)據(jù)
    keys = dict_txt.keys()
    data = []
    for s in sentence:
        # 判斷是否存在未知字符
        if not s in keys:
            s = '<unk>'
        data.append((np.int64)(dict_txt[s]))
    return data
 
def batch_reader(Json_list,json_path):
    datas = []
    gc.collect()
    json_files = []
    falsefiles = []
    datas.clear()
    falsefiles.clear()
    json_files.clear()
    start = time.perf_counter()
    i=0
    scale = 100
    for file in Json_list:
        if i % 100 == 0:
            a = '=' * int(i / len(Json_list) * 100)
            b = ' ' * (scale - int(i / len(Json_list) * 100))
            c = int(i / len(Json_list) * 100)
            dur = time.perf_counter() - start
            sys.stdout.write("\r{:^3.0f}%[{}=>{}]{:.2f}s".format(c, a, b, dur))
            sys.stdout.flush()
        i+=1
        with open(os.path.join(json_path, file), "r", encoding='utf-8') as f:
            try:
                text = json.load(f)
            except:
                falsefiles.append(file)
                continue
            json_files.append(os.path.join(json_path, file))
            json_text = text['content']
            data = get_data(json_text)
            datas.append(data)
    for file in falsefiles:
        os.remove(os.path.join(dir_path, file))
    file_count = len(Json_list) - len(falsefiles)
    a = '=' * 100
    b = ' ' * 0
    c = 100
    dur = time.perf_counter() - start
    sys.stdout.write("\r{:^3.0f}%[{}=>{}]{:.2f}s".format(c, a, b, dur))
    sys.stdout.flush()
    print('\n文本數(shù)據(jù)獲取完畢，共計(jì){0}條文本數(shù)據(jù)，有效數(shù)據(jù){2}條，無效數(shù)據(jù){1}條（已刪除）！'.format(len(Json_list),len(falsefiles),file_count))
    print('AI正在加載分類模型...')
    # 獲取每句話的單詞數(shù)量
    base_shape = [[len(c) for c in datas]]
 
    # 生成預(yù)測(cè)數(shù)據(jù)
    tensor_words = fluid.create_lod_tensor(datas, base_shape, place)
 
    # 執(zhí)行預(yù)測(cè)
    result = exe.run(program=infer_program,
                     feed={feeded_var_names[0]: tensor_words},
                     fetch_list=target_var)
    print('模型加載完畢！')
    # 分類名稱
    names = ['文化', '娛樂', '體育', '財(cái)經(jīng)', '房產(chǎn)', '汽車', '教育', '科技', '國際', '證券']
    count = np.zeros(10)
    print('AI正在對(duì)文本數(shù)據(jù)進(jìn)行分類并上傳至ES：')
    # 獲取結(jié)果概率最大的label
    start = time.perf_counter()
    for i in range(file_count):
        if i % 100 == 0:
            a = '=' * int(i / file_count * 100)
            b = ' ' * (scale - int(i / file_count * 100))
            c = int(i / file_count * 100)
            dur = time.perf_counter() - start
            sys.stdout.write("\r{:^3.0f}%[{}=>{}]{:.2f}s".format(c, a, b, dur))
            sys.stdout.flush()
        lab = np.argsort(result)[0][i][-1]
        # print('預(yù)測(cè)結(jié)果標(biāo)簽為：%d，  名稱為：%s， 概率為：%f' % (lab, names[lab], result[0][i][lab]))
        count[lab] += 1
        with open(json_files[i], 'r', encoding='utf-8') as load_f:
            try:
                text = json.load(load_f)
            except:
                continue
        text['content_type'] = names[lab]
 
        id = json_files[i].split('\\')[-1].split('.')[0]
        #try:
        del text['link']
        response = es.index(index='page', doc_type='_doc', id=id, body=text)
        #except Exception:
        # print("\n" + "數(shù)據(jù) " + id + " 插入失敗，錯(cuò)誤信息：" + response)
 
        # with open(os.path.join(json_path,json_files[i].split('\\')[-1]),'w') as dump_f:
        #     json.dump(text,dump_f)
    a = '=' * 100
    b = ' ' * 0
    c = 100
    dur = time.perf_counter() - start
    sys.stdout.write("\r{:^3.0f}%[{}=>{}]{:.2f}s".format(c, a, b, dur))
    sys.stdout.flush()
    print("\n" + "%d條文本數(shù)據(jù)分類結(jié)束！已全部上傳至ES" % (file_count))
 
 
def run():
    # 獲取圖片數(shù)據(jù)
    print('AI正在獲取文本數(shù)據(jù)...')
    json_path = os.path.realpath(__file__) + '/../../../RawData'
    Json_list = os.listdir(json_path)
    batch_size=500
    if len(Json_list)>batch_size:
        Json_batch=0
        print('當(dāng)前文本數(shù)量為{0}條，正在分批處理...'.format(len(Json_list)))
        for batch_id in range(math.ceil(len(Json_list)/batch_size)):
            a=(batch_size if batch_size<(len(Json_list)-Json_batch) else len(Json_list)-Json_batch)
            print('正在處理第{0}批，數(shù)量為{1}...'.format(batch_id+1,a))
            batch_reader(Json_list[Json_batch:Json_batch+a],json_path)
            Json_batch += a
    else:
        batch_reader(Json_list,json_path)
 
 
if __name__ == '__main__':
    run()