基于python分析你的上網(wǎng)行為看看你平時(shí)上網(wǎng)都在干嘛

更新時(shí)間：2019年08月13日 09:05:36 作者：孤鳥

這篇文章主要介紹了基于python分析你的上網(wǎng)行為看看你平時(shí)上網(wǎng)都在干嘛,文中通過示例代碼介紹的非常詳細(xì)，對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下

簡介

想看看你最近一年都在干嘛？看看你平時(shí)上網(wǎng)是在摸魚還是認(rèn)真工作？想寫年度匯報(bào)總結(jié)，但是苦于沒有數(shù)據(jù)？現(xiàn)在，它來了。

這是一個(gè)能讓你了解自己的瀏覽歷史的Chrome瀏覽歷史記錄分析程序，當(dāng)然了，他僅適用于Chrome瀏覽器或者以Chrome為內(nèi)核的瀏覽器。

在該頁面中你將可以查看有關(guān)自己在過去的時(shí)間里所訪問瀏覽的域名、URL以及忙碌天數(shù)的前十排名以及相關(guān)的數(shù)據(jù)圖表。

部分截圖

代碼思路

1. 目錄結(jié)構(gòu)

首先，我們先看一下整體目錄結(jié)構(gòu)

Code
├─ app_callback.py             回調(diào)函數(shù)，實(shí)現(xiàn)后臺功能
├─ app_configuration.py           web服務(wù)器配置
├─ app_layout.py              web前端頁面配置
├─ app_plot.py               web圖表繪制
├─ app.py                  web服務(wù)器的啟動
├─ assets                  web所需的一些靜態(tài)資源文件
│ ├─ css                  web前端元素布局文件
│ │ ├─ custum-styles_phyloapp.css
│ │ └─ stylesheet.css
│ ├─ image                 web前端logo圖標(biāo)
│ │ ├─ GitHub-Mark-Light.png
│ └─ static                web前端幫助頁面
│ │ ├─ help.html
│ │ └─ help.md
├─ history_data.py             解析chrome歷史記錄文件
└─ requirement.txt             程序所需依賴庫

app_callback.py

該程序基于python，使用dash web輕量級框架進(jìn)行部署。app_callback.py主要用于回調(diào)，可以理解為實(shí)現(xiàn)后臺功能。

app_configuration.py

顧名思義，對web服務(wù)器的一些配置操作。

app_layout..py

web前端頁面配置，包含html, css元素。

app_plot.py

這個(gè)主要是為實(shí)現(xiàn)一些web前端的圖表數(shù)據(jù)。

app.py

web服務(wù)器的啟動。

assets

靜態(tài)資源目錄，用于存儲一些我們所需要的靜態(tài)資源數(shù)據(jù)。

history_data.py

通過連接sqlite數(shù)據(jù)庫，并解析Chrome歷史記錄文件。

requirement.txt

運(yùn)行本程序所需要的依賴庫。

2. 解析歷史記錄文件數(shù)據(jù)

與解析歷史記錄文件數(shù)據(jù)有關(guān)的文件為history_data.py文件。我們一一分析。

# 查詢數(shù)據(jù)庫內(nèi)容
def query_sqlite_db(history_db, query):
  # 查詢sqlite數(shù)據(jù)庫
  # 注意，History是一個(gè)文件，沒有后綴名。它不是一個(gè)目錄。
  conn = sqlite3.connect(history_db)
  cursor = conn.cursor()
  # 使用sqlite查看軟件，可清晰看到表visits的字段url=表urls的字段id
  # 連接表urls和visits，并獲取指定數(shù)據(jù)
  select_statement = query
  # 執(zhí)行數(shù)據(jù)庫查詢語句
  cursor.execute(select_statement)
  # 獲取數(shù)據(jù)，數(shù)據(jù)格式為元組(tuple)
  results = cursor.fetchall()
  # 關(guān)閉
  cursor.close()
  conn.close()
  return results

該函數(shù)的代碼流程為:

連接sqlite數(shù)據(jù)庫，執(zhí)行查詢語句，返回查詢結(jié)構(gòu)，最終關(guān)閉數(shù)據(jù)庫連接。

# 獲取排序后的歷史數(shù)據(jù)
def get_history_data(history_file_path):
  try:
    # 獲取數(shù)據(jù)庫內(nèi)容
    # 數(shù)據(jù)格式為元組(tuple)
    select_statement = "SELECT urls.id, urls.url, urls.title, urls.last_visit_time, urls.visit_count, visits.visit_time, visits.from_visit, visits.transition, visits.visit_duration FROM urls, visits WHERE urls.id = visits.url;"
    result = query_sqlite_db(history_file_path, select_statement)
    # 將結(jié)果按第1個(gè)元素進(jìn)行排序
    # sort和sorted內(nèi)建函數(shù)會優(yōu)先排序第1個(gè)元素，然后再排序第2個(gè)元素，依此類推
    result_sort = sorted(result, key=lambda x: (x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8]))

    # 返回排序后的數(shù)據(jù)
    return result_sort
  except:
    # print('讀取出錯(cuò)!')
    return 'error'

該函數(shù)的代碼流程為:

設(shè)置數(shù)據(jù)庫查詢語句select_statement，調(diào)用query_sqlite_db()函數(shù)，獲取解析后的歷史記錄文件數(shù)據(jù)。并對返回后的歷史記錄數(shù)據(jù)文件按照不同元素規(guī)則進(jìn)行排序。至此，經(jīng)過排序的解析后的歷史記錄數(shù)據(jù)文件獲取成功。

3. web服務(wù)器基本配置

與web服務(wù)器基本配置有關(guān)的文件為app_configuration.py和app.py文件。包括設(shè)置web服務(wù)器的端口號，訪問權(quán)限，靜態(tài)資源目錄等。

4. 前端頁面部署

與前端部署有關(guān)的文件為app_layout.py和app_plot.py以及assets目錄。

前端布局主要包括以下幾個(gè)元素：

上傳歷史記錄文件組件
繪制頁面訪問次數(shù)組件
繪制頁面訪問停留總時(shí)間排名組件
每日頁面訪問次數(shù)散點(diǎn)圖組件
某日不同時(shí)刻訪問次數(shù)散點(diǎn)圖組件
訪問次數(shù)最多的10個(gè)URL組件
搜索關(guān)鍵詞排名組件
搜索引擎使用情況組件

在app_layout.py中，這些組件的配置大多一樣，和平常的html, css配置一樣，所以我們僅僅以配置頁面訪問次數(shù)排名組件為例子。

# 頁面訪問次數(shù)排名
html.Div(
  style={'margin-bottom':'150px'},
  children=[
    html.Div(
      style={'border-top-style':'solid','border-bottom-style':'solid'},
      className='row',
      children=[
        html.Span(
          children='頁面訪問次數(shù)排名, ',
          style={'font-weight': 'bold', 'color':'red'}
        ),

        html.Span(
          children='顯示個(gè)數(shù):',
        ),
        dcc.Input(
          id='input_website_count_rank',
          type='text',
          value=10,
          style={'margin-top':'10px', 'margin-bottom':'10px'}
        ),
      ]
    ),


    html.Div(
      style={'position': 'relative', 'margin': '0 auto', 'width': '100%', 'padding-bottom': '50%', },
      children=[
        dcc.Loading(
          children=[
            dcc.Graph(
              id='graph_website_count_rank',
              style={'position': 'absolute', 'width': '100%', 'height': '100%', 'top': '0',
                  'left': '0', 'bottom': '0', 'right': '0'},
              config={'displayModeBar': False},
            ),
          ],
          type='dot',
          style={'position': 'absolute', 'top': '50%', 'left': '50%', 'transform': 'translate(-50%,-50%)'}
        ),
      ],
    )
  ]
)

可以看到，雖然是python編寫的，但是只要具備前端經(jīng)驗(yàn)的人，都可以輕而易舉地在此基礎(chǔ)上新增或者刪除一些元素，所以我們就不詳細(xì)講如何使用html和css了。

在app_plot.py中，主要是以繪制圖表相關(guān)的。使用的是plotly庫，這是一個(gè)用于具有web交互的畫圖組件庫。
這里以繪制頁面訪問頻率排名柱狀圖為例子，講講如何使用plotly庫進(jìn)行繪制。

# 繪制 頁面訪問頻率排名 柱狀圖
def plot_bar_website_count_rank(value, history_data):
  # 頻率字典
  dict_data = {}
  # 對歷史記錄文件進(jìn)行遍歷
  for data in history_data:
    url = data[1]
    # 簡化url
    key = url_simplification(url)
    if (key in dict_data.keys()):
      dict_data[key] += 1
    else:
      dict_data[key] = 0
  # 篩選出前k個(gè)頻率最高的數(shù)據(jù)
  k = convert_to_number(value)
  top_10_dict = get_top_k_from_dict(dict_data, k)

  figure = go.Figure(
    data=[
      go.Bar(
        x=[i for i in top_10_dict.keys()],
        y=[i for i in top_10_dict.values()],
        name='bar',
        marker=go.bar.Marker(
          color='rgb(55, 83, 109)'
        )
      )
    ],
    layout=go.Layout(
      showlegend=False,
      margin=go.layout.Margin(l=40, r=0, t=40, b=30),
      paper_bgcolor='rgba(0,0,0,0)',
      plot_bgcolor='rgba(0,0,0,0)',
      xaxis=dict(title='網(wǎng)站'),
      yaxis=dict(title='次數(shù)')
    )
  )

  return figure

該函數(shù)的代碼流程為:

首先，對解析完數(shù)據(jù)庫文件后返回的history_data進(jìn)行遍歷，獲得url數(shù)據(jù)，并調(diào)用url_simplification(url)對齊進(jìn)行簡化。接著，依次將簡化后的url存入字典中。
調(diào)用get_top_k_from_dict(dict_data, k)，從字典dict_data中獲取前k個(gè)最大值的數(shù)據(jù)。
接著，開始繪制柱狀圖了。使用go.Bar()繪制柱狀圖，其中，x和y代表的是屬性和屬性對應(yīng)的數(shù)值，為list格式。xaxis和yaxis`分別設(shè)置相應(yīng)坐標(biāo)軸的標(biāo)題
返回一個(gè)figure對象，以便于傳輸給前端。
而assets目錄下包含的數(shù)據(jù)為image和css，都是用于前端布局。

5. 后臺部署

與后臺部署有關(guān)的文件為app_callback.py文件。這個(gè)文件使用回調(diào)的方式對前端頁面布局進(jìn)行更新。

首先，我們看看關(guān)于頁面訪問頻率排名的回調(diào)函數(shù)：

# 頁面訪問頻率排名
@app.callback(
  dash.dependencies.Output('graph_website_count_rank', 'figure'),
  [
    dash.dependencies.Input('input_website_count_rank', 'value'),
    dash.dependencies.Input('store_memory_history_data', 'data')
  ]
)
def update(value, store_memory_history_data):

  # 正確獲取到歷史記錄文件
  if store_memory_history_data:
    history_data = store_memory_history_data['history_data']
    figure = plot_bar_website_count_rank(value, history_data)
    return figure
  else:
    # 取消更新頁面數(shù)據(jù)
    raise dash.exceptions.PreventUpdate("cancel the callback")

該函數(shù)的代碼流程為:

首先確定好輸入是什么(觸發(fā)回調(diào)的數(shù)據(jù))，輸出是什么(回調(diào)輸出的數(shù)據(jù))，需要帶上什么數(shù)據(jù)。dash.dependencies.Input指的是觸發(fā)回調(diào)的數(shù)據(jù)，而dash.dependencies.Input('input_website_count_rank', 'value')表示當(dāng)id為input_website_count_rank的組件的value發(fā)生改變時(shí)，會觸發(fā)這個(gè)回調(diào)。而該回調(diào)經(jīng)過update(value, store_memory_history_data)的結(jié)果會輸出到id為graph_website_count_rank的value，通俗來講，就是改變它的值。

對于def update(value, store_memory_history_data)的解析。首先是判斷輸入數(shù)據(jù)store_memory_history_data是否不為空對象，接著讀取歷史記錄文件history_data，接著調(diào)用剛才所說的app_plot.py文件中的plot_bar_website_count_rank()，返回一個(gè)figure對象，并將這個(gè)對象返回到前端。至此，前端頁面的布局就會顯示出頁面訪問頻率排名的圖表了。
還有一個(gè)需要說的就是關(guān)于上次文件的過程，這里我們先貼出代碼：

# 上傳文件回調(diào)
@app.callback(
  dash.dependencies.Output('store_memory_history_data', 'data'),
  [
    dash.dependencies.Input('dcc_upload_file', 'contents')
  ]
)
def update(contents):
  if contents is not None:
    # 接收base64編碼的數(shù)據(jù)
    content_type, content_string = contents.split(',')
    # 將客戶端上傳的文件進(jìn)行base64解碼
    decoded = base64.b64decode(content_string)
    # 為客戶端上傳的文件添加后綴，防止文件重復(fù)覆蓋
    # 以下方式確保文件名不重復(fù)
    suffix = [str(random.randint(0,100)) for i in range(10)]
    suffix = "".join(suffix)
    suffix = suffix + str(int(time.time()))
    # 最終的文件名
    file_name = 'History_' + suffix
    # print(file_name)
    # 創(chuàng)建存放文件的目錄
    if (not (exists('data'))):
      makedirs('data')

    # 欲寫入的文件路徑
    path = 'data' + '/' + file_name

    # 寫入本地磁盤文件
    with open(file=path, mode='wb+') as f:
      f.write(decoded)


    # 使用sqlite讀取本地磁盤文件
    # 獲取歷史記錄數(shù)據(jù)
    history_data = get_history_data(path)
    
    # 獲取搜索關(guān)鍵詞數(shù)據(jù)
    search_word = get_search_word(path)

    # 判斷讀取到的數(shù)據(jù)是否正確
    if (history_data != 'error'):
      # 找到
      date_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
      print('新接收到一條客戶端的數(shù)據(jù), 數(shù)據(jù)正確, 時(shí)間:{}'.format(date_time))
      store_data = {'history_data': history_data, 'search_word': search_word}
      return store_data
    else:
      # 沒找到
      date_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
      print('新接收到一條客戶端的數(shù)據(jù), 數(shù)據(jù)錯(cuò)誤, 時(shí)間:{}'.format(date_time))
      return None
  return None

該函數(shù)的代碼流程為:

首先判斷用戶上傳的數(shù)據(jù)contents是否不為空，接著將客戶端上傳的文件進(jìn)行base64解碼。并且，為客戶端上傳的文件添加后綴，防止文件重復(fù)覆蓋，最終將客戶端上傳的文件寫入本地磁盤文件。
寫入完畢后，使用sqlite讀取本地磁盤文件，若讀取正確，則返回解析后的數(shù)據(jù)，否則返回None

如何運(yùn)行

在線演示程序:http://39.106.118.77:8090(普通服務(wù)器，勿測壓)

運(yùn)行本程序十分簡單，只需要按照以下命令即可運(yùn)行：

# 跳轉(zhuǎn)到當(dāng)前目錄
cd 目錄名
# 先卸載依賴庫
pip uninstall -y -r requirement.txt
# 再重新安裝依賴庫
pip install -r requirement.txt
# 開始運(yùn)行
python app.py
# 運(yùn)行成功后，通過瀏覽器打開http://localhost:8090

補(bǔ)充

完整版源代碼存放在github上，有需要的可以下載

項(xiàng)目持續(xù)更新，歡迎您star本項(xiàng)目

以上就是本文的全部內(nèi)容，希望對大家的學(xué)習(xí)有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章: