快捷導(dǎo)航

使用Python編程分析火爆全網(wǎng)的魷魚游戲豆瓣影評(píng)

更新時(shí)間：2021年10月08日 10:04:35 作者：小張Python

本文來為大家介紹如何使用Python爬取影評(píng)的操作，主要是爬取《魷魚游戲》在豆瓣上的一些影評(píng)，對(duì)數(shù)據(jù)做一些簡單的分析，用數(shù)據(jù)的角度重新審視下這部劇，有需要的朋友可以借鑒參考下

技術(shù)工具

在正文開始之前，先介紹下本篇文章中用到的技術(shù)棧和工具。

本文用到的技術(shù)棧和工具如下，歸結(jié)為四個(gè)方面；

語言：Python，Vue ，javascript；
存儲(chǔ)：MongoDB;
庫：echarts ，Pymongo，WordArt…
軟件：Photoshop；

數(shù)據(jù)采集

本次數(shù)據(jù)采集的目標(biāo)網(wǎng)站為豆瓣，但自己的賬號(hào)之前被封，所以只能采集到大概二百來?xiàng)l數(shù)據(jù)，豆瓣有相應(yīng)的反爬機(jī)制，瀏覽10頁以上的評(píng)論需要用戶登錄才能進(jìn)行下一步操作

至于為啥賬號(hào)被封，是因?yàn)橹白约簩W(xué)爬蟲時(shí)不知道在哪里搞的【豆瓣模擬登錄】代碼，當(dāng)時(shí)不知道代碼有沒有問題，愣頭青直接用自己的號(hào)試了下，誰知道剛試完就被封了，而且還是永久的那種

圖1

在這里也給大家提個(gè)醒在以后做爬蟲時(shí)，模擬登錄時(shí)盡量用一些測試賬號(hào)，能不用自己的號(hào)就別用，

這次數(shù)據(jù)采集也比較簡單，就是更改圖2 中 url 上的 start 參數(shù)，以 offset 為 20 的規(guī)則作為下一頁 url 的拼接；

圖2

拿到請(qǐng)求連接之后，用 requests 的 get 請(qǐng)求，再對(duì)獲取到的 html 數(shù)據(jù)做個(gè)解析，就能獲取到我們需要的數(shù)據(jù)了；采集核心代碼貼在下方

for offset in range(0,220,20):
    url = "https://movie.douban.com/subject/34812928/comments?start={}&limit=20&status=P&sort=new_score".format(offset)
    res = requests.get(url,headers= headers)
    # print(res.text)
    soup = BeautifulSoup(res.text,'lxml')
    time.sleep(2)
    for comment_item in soup.select("#comments > .comment-item"):
        try:

            data_item = []
            avatar = comment_item.select(".avatar a img")[0].get("src")
            name = comment_item.select(".comment h3 .comment-info a")[0]
            rate = comment_item.select(".comment h3 .comment-info span:nth-child(3)")[0]
            date = comment_item.select(".comment h3 .comment-info span:nth-child(4)")[0]
            comment = comment_item.select(".comment .comment-content span")[0]
            # comment_item.get("div img").ge
            data_item.append(avatar)
            data_item.append(str(name.string).strip("\t"))
            data_item.append(str(rate.get("class")[0]).strip("allstar").strip('\t').strip("\n"))
            data_item.append(str(date.string).replace('\n','').strip('\t'))
            data_item.append(str(comment.string).strip("\t").strip("\n"))
            data_json ={
                'avatar':avatar,
                'name': str(name.string).strip("\t"),
                'rate': str(rate.get("class")[0]).strip("allstar").strip('\t').strip("\n"),
                'date' : str(date.string).replace('\n','').replace('\t','').strip(' '),
                'comment': str(comment.string).strip("\t").strip("\n")
            }
            if not (collection.find_one({'avatar':avatar})):
               print("data _json is {}".format(data_json))
               collection.insert_one(data_json)
            f.write('\t'.join(data_item))
            f.write("\n")
        except Exception as e:
            print(e)
            continue

豆瓣爬取時(shí)需要記得加上 cookie 和 User-Agent，否則不會(huì)有數(shù)據(jù)為空，

為了后面數(shù)據(jù)可視化提取方便，本文用的是 Mongodb 作為數(shù)據(jù)存儲(chǔ)，共有211 條數(shù)據(jù)，主要采集的數(shù)據(jù)字段為 avatar，name、rate、date、comment，分別表示用戶頭像、用戶名字、星級(jí)、日期，評(píng)論；結(jié)果見圖3；