快捷導(dǎo)航

通過selenium抓取某東的TT購買記錄并分析趨勢過程解析

更新時間：2019年08月15日 09:06:09 作者：alunbar

這篇文章主要介紹了通過selenium抓取某東的TT購買記錄并分析趨勢過程解析,文中通過示例代碼介紹的非常詳細(xì)，對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友可以參考下

最近學(xué)習(xí)了一些爬蟲技術(shù)，想做個小項目檢驗下自己的學(xué)習(xí)成果，在逛某東的時候，突然給我推薦一個TT的產(chǎn)品，點擊進去瀏覽一番之后就產(chǎn)生了抓取TT產(chǎn)品，然后進行數(shù)據(jù)分析，看下那個品牌的TT賣得最好。

本文通過selenium抓取TT信息，存入到mongodb數(shù)據(jù)庫中。

抓取TT產(chǎn)品信息

TT產(chǎn)品頁面的連接是

https://list.jd.com/list.html?cat=9192,9196,1502&page=1&sort=sort_totalsales15_desc&trans=1&JL=6_0_0#J_main

上面有個page參數(shù)，表示第幾頁。改變這個參數(shù)就可以爬取到不同頁面的TT產(chǎn)品。

通過開發(fā)者工具看下如果抓取TT的產(chǎn)品信息，例如名字、品牌、價格、評論數(shù)量等。

通過上圖可以看到一個TT產(chǎn)品信息對應(yīng)的源代碼是一個class為gl-item的li節(jié)點<li class='gl-item'>。li節(jié)點中data-sku屬性是產(chǎn)品的ID，后面抓取產(chǎn)品的評論信息會用到，brand_id是品牌ID。class為p-price的div節(jié)點對應(yīng)的是TT產(chǎn)品的價格信息。class為p-comment的div節(jié)點對應(yīng)的是評論總數(shù)信息。

開始使用requests是總是無法解析到TT的價格和評論信息，最后適應(yīng)selenium才解決了這個問題，如果有人知道怎么解決這問題，望不吝賜教。

下面介紹抓取TT產(chǎn)品評論信息。

點擊一個TT產(chǎn)品，會跳轉(zhuǎn)到產(chǎn)品詳細(xì)頁面，點擊“商品評論”，然后勾選上“只看當(dāng)前商品評價”選項（如果不勾選，就會看到該系列產(chǎn)品的評價）就會看到商品評論信息，我們用開發(fā)者工具看下如果抓取評論信息。

如上圖所示，在開發(fā)者工具中，點擊Network選項，就會看到

https://club.jd.com/discussion/getSkuProductPageImageCommentList.action?productId=3521615&isShadowSku=0&callback=jQuery6014001&page=2&pageSize=10&_=1547042223100

的鏈接，這個鏈接返回的是json數(shù)據(jù)。其中productId就是TT產(chǎn)品頁面的data-sku屬性的數(shù)據(jù)。page參數(shù)是第幾頁評論。返回的json數(shù)據(jù)中，content是評論數(shù)，createTime是下單時間。

代碼如下：

def parse_product(page,html):
  doc = pq(html)
  li_list = doc('.gl-item').items()
  for li in li_list:
    product_id = li('.gl-i-wrap').attr('data-sku')
    brand_id = li('.gl-i-wrap').attr('brand_id')
    time.sleep(get_random_time())
    title = li('.p-name').find('em').text()
    price_items = li('.p-price').find('.J_price').find('i').items()
    price = 0
    for price_item in price_items:
      price = price_item.text()
      break
    total_comment_num = li('.p-commit').find('strong a').text()
    if total_comment_num.endswith("萬+"):
      print('總評價數(shù)量：' + total_comment_num)
      total_comment_num = str(int(float(total_comment_num[0:len(total_comment_num) -2]) * 10000))
      print('轉(zhuǎn)換后總評價數(shù)量：' + total_comment_num)
    elif total_comment_num.endswith("+"):
      total_comment_num = total_comment_num[0:len(total_comment_num) - 1]
    condom = {}
    condom["product_id"] = product_id
    condom["brand_id"] = brand_id
    condom["condom_name"] = title
    condom["total_comment_num"] = total_comment_num
    condom["price"] = price
    comment_url = 'https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98vv117396&productId=%s&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'
    comment_url = comment_url %(product_id)
    response = requests.get(comment_url,headers = headers)
    if response.text == '':
      for i in range(0,10):
        time.sleep(get_random_time())
        try:
          response = requests.get(comment_url, headers=headers)
        except requests.exceptions.ProxyError:
          time.sleep(get_random_time())
          response = requests.get(comment_url, headers=headers)
        if response.text:
          break
        else:
          continue
    text = response.text
    text = text[28:len(text) - 2]
    jsons = json.loads(text)
    productCommentSummary = jsons.get('productCommentSummary')
    # productCommentSummary = response.json().get('productCommentSummary')
    poor_count = productCommentSummary.get('poorCount')
    general_count = productCommentSummary.get('generalCount')
    good_count = productCommentSummary.get('goodCount')
    comment_count = productCommentSummary.get('commentCount')
    poor_rate = productCommentSummary.get('poorRate')
    good_rate = productCommentSummary.get('goodRate')
    general_rate = productCommentSummary.get('generalRate')
    default_good_count = productCommentSummary.get('defaultGoodCount')
    condom["poor_count"] = poor_count
    condom["general_count"] = general_count
    condom["good_count"] = good_count
    condom["comment_count"] = comment_count
    condom["poor_rate"] = poor_rate
    condom["good_rate"] = good_rate
    condom["general_rate"] = general_rate
    condom["default_good_count"] = default_good_count
    collection.insert(condom)
    comments = jsons.get('comments')
    if comments:
      for comment in comments:
        print('解析評論')
        condom_comment = {}
        reference_time = comment.get('referenceTime')
        content = comment.get('content')
        product_color = comment.get('productColor')
        user_client_show = comment.get('userClientShow')
        user_level_name = comment.get('userLevelName')
        is_mobile = comment.get('isMobile')
        creation_time = comment.get('creationTime')
        guid = comment.get("guid")
        condom_comment["reference_time"] = reference_time
        condom_comment["content"] = content
        condom_comment["product_color"] = product_color
        condom_comment["user_client_show"] = user_client_show
        condom_comment["user_level_name"] = user_level_name
        condom_comment["is_mobile"] = is_mobile
        condom_comment["creation_time"] = creation_time
        condom_comment["guid"] = guid
        collection_comment.insert(condom_comment)
    parse_comment(product_id)
def parse_comment(product_id):
  comment_url = 'https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98vv117396&productId=%s&score=0&sortType=5&page=%d&pageSize=10&isShadowSku=0&fold=1'
  for i in range(1,200):
    time.sleep(get_random_time())
    time.sleep(get_random_time())
    print('抓取第' + str(i) + '頁評論')
    url = comment_url%(product_id,i)
    response = requests.get(url, headers=headers,timeout=10)
    print(response.status_code)
    if response.text == '':
      for i in range(0,10):
        print('抓取不到數(shù)據(jù)')
        response = requests.get(comment_url, headers=headers)
        if response.text:
          break
        else:
          continue
    text = response.text
    print(text)
    text = text[28:len(text) - 2]
    print(text)
    jsons = json.loads(text)
    comments = jsons.get('comments')
    if comments:
      for comment in comments:
        print('解析評論')
        condom_comment = {}
        reference_time = comment.get('referenceTime')
        content = comment.get('content')
        product_color = comment.get('productColor')
        user_client_show = comment.get('userClientShow')
        user_level_name = comment.get('userLevelName')
        is_mobile = comment.get('isMobile')
        creation_time = comment.get('creationTime')
        guid = comment.get("guid")
        id = comment.get("id")
        condom_comment["reference_time"] = reference_time
        condom_comment["content"] = content
        condom_comment["product_color"] = product_color
        condom_comment["user_client_show"] = user_client_show
        condom_comment["user_level_name"] = user_level_name
        condom_comment["is_mobile"] = is_mobile
        condom_comment["creation_time"] = creation_time
        condom_comment["guid"] = guid
        condom_comment["id"] = id
        collection_comment.insert(condom_comment)
    else:
      break

如果想要獲取抓取TT數(shù)據(jù)和評論的代碼，請關(guān)注我的公眾號“python_ai_bigdata”,然后恢復(fù)TT獲取代碼。

一共抓取了8934條產(chǎn)品信息和17萬條評論(購買)記錄。

產(chǎn)品最多的品牌

先分析8934個產(chǎn)品，看下哪個品牌的TT在京東上賣得最多。由于品牌過多，京東上銷售TT的品牌就有299個，我們只取賣得最多的前10個品牌。

從上面的圖可以看出，排名第1的是杜杜，岡本次之，邦邦第3，前10品牌分別是杜蕾斯、岡本、杰士邦、倍力樂、名流、第六感、尚牌、赤尾、諾絲和米奧。這10個品牌中有5個是我沒見過的，分別是倍力樂、名流、尚牌、赤尾和米奧，其他的都見過，特別是杜杜和邦邦常年占據(jù)各大超市收銀臺的醒目位置。

這10個品牌中，杜蕾斯來自英國，岡本來自日本，杰士邦、第六感、赤尾、米奧和名流是國產(chǎn)的品牌，第六感是杰士邦旗下的一個避孕套品牌；倍力樂是中美合資的品牌，尚牌來自泰國，諾絲是來自美國的品牌。

代碼：

import pymongo 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from pandas import DataFrame,Series
client = pymongo.MongoClient(host='localhost',port=27017) 
db = client.condomdb
condom_new = db.condom_new
cursor = condom_new.find() 
condom_df = pd.DataFrame(list(cursor)) 
brand_name_df = condom_df['brand_name'].to_frame()
brand_name_df['condom_num'] = 1
brand_name_group = brand_name_df.groupby('brand_name').sum()
brand_name_sort = brand_name_group.sort_values(by='condom_num', ascending=False)
brand_name_top10 = brand_name_sort.head(10)
# print(3 * np.random.rand(4))
index_list = []
labels = []
value_list = []
for index,row in brand_name_top10.iterrows():
  index_list.append(index)
  labels.append(index)
  value_list.append(int(row['condom_num']))
plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標(biāo)簽
plt.rcParams['axes.unicode_minus']=False #用來正常顯示負(fù)號

series_condom = pd.Series(value_list, index=index_list, name='')
series_condom.plot.pie(labels=labels,
         autopct='%.2f', fontsize=10, figsize=(10, 10))

賣得最好的產(chǎn)品

可以根據(jù)產(chǎn)品評價數(shù)量來判斷一個產(chǎn)品賣得好壞，評價數(shù)最多的產(chǎn)品通常也是賣得最好的。

產(chǎn)品評論中有個產(chǎn)品評論總數(shù)的字段，我們就根據(jù)這個字段來排序，看下評論數(shù)量最多的前10個產(chǎn)品是什么（也就是評論數(shù)量最多的）。

從上圖可以看出，賣得最好的還是杜杜的產(chǎn)品，10席中占了6席。杜杜的情愛四合一以1180000萬的銷量排名第一。

最受歡迎的是超薄的TT，占了8席，持久型的也比較受歡迎，狼牙套竟然也上榜了，真是大大的出乎我的意料。

銷量分析

下圖是TT銷量最好的10天

可以看出這10天分別分布在6月、11月和12月，應(yīng)該和我們熟知的618、雙11和雙12購物節(jié)有關(guān)。

現(xiàn)在很多電商都有自己的購物節(jié)，像618，雙11和雙12。由于一個產(chǎn)品最多只能顯示100頁的評論，每頁10條評論，一個產(chǎn)品最多只能爬取到1000條評論，對于銷量達(dá)到118萬的情愛四合一來說，1000條評論不具有代表性，但是總的來說通過上圖的分析，可以知道電商做活動的月份銷量一般比較好。

下圖是每個月份TT銷售量柱狀圖，更加驗證了上面的說法。