用Python 爬取貓眼電影數(shù)據(jù)分析《無名之輩》

更新時(shí)間：2020年07月24日 17:29:27 作者：有趣的Python

這篇文章主要介紹了用Python 爬取貓眼電影數(shù)據(jù)分析《無名之輩》，文中通過示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧

前言

作者：羅昭成

PS：如有需要Python學(xué)習(xí)資料的小伙伴可以加點(diǎn)擊下方鏈接自行獲取

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

獲取貓眼接口數(shù)據(jù)

作為一個(gè)長期宅在家的程序員，對(duì)各種抓包簡直是信手拈來。在 Chrome 中查看原代碼的模式，可以很清晰地看到接口，接口地址即為：http://m.maoyan.com/mmdb/comments/movie/1208282.json?_v_=yes&offset=15

在 Python 中，我們可以很方便地使用 request 來發(fā)送網(wǎng)絡(luò)請(qǐng)求，進(jìn)而拿到返回結(jié)果：

def getMoveinfo(url):
 session = requests.Session()
 headers = {
  "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X)"
 }
 response = session.get(url, headers=headers)
 if response.status_code == 200:
  return response.text
 return None

根據(jù)上面的請(qǐng)求，我們能拿到此接口的返回?cái)?shù)據(jù)，數(shù)據(jù)內(nèi)容有很多信息，但有很多信息是我們并不需要的，先來總體看看返回的數(shù)據(jù)：

{
 "cmts":[
  {
   "approve":0,
   "approved":false,
   "assistAwardInfo":{
    "avatar":"",
    "celebrityId":0,
    "celebrityName":"",
    "rank":0,
    "title":""
   },
   "authInfo":"",
   "cityName":"貴陽",
   "content":"必須十分，借錢都要看的一部電影。",
   "filmView":false,
   "id":1045570589,
   "isMajor":false,
   "juryLevel":0,
   "majorType":0,
   "movieId":1208282,
   "nick":"nick",
   "nickName":"nickName",
   "oppose":0,
   "pro":false,
   "reply":0,
   "score":5,
   "spoiler":0,
   "startTime":"2018-11-22 23:52:58",
   "supportComment":true,
   "supportLike":true,
   "sureViewed":1,
   "tagList":{
    "fixed":[
     {
      "id":1,
      "name":"好評(píng)"
     },
     {
      "id":4,
      "name":"購票"
     }
    ]
   },
   "time":"2018-11-22 23:52",
   "userId":1871534544,
   "userLevel":2,
   "videoDuration":0,
   "vipInfo":"",
   "vipType":0
  }
 ]
}

如此多的數(shù)據(jù)，我們感興趣的只有以下這幾個(gè)字段：

nickName, cityName, content, startTime， score

接下來，進(jìn)行我們比較重要的數(shù)據(jù)處理，從拿到的 JSON 數(shù)據(jù)中解析出需要的字段：

def parseInfo(data):
 data = json.loads(html)['cmts']
 for item in data:
  yield{
   'date':item['startTime'],
   'nickname':item['nickName'],
   'city':item['cityName'],
   'rate':item['score'],
   'conment':item['content']
  }

拿到數(shù)據(jù)后，我們就可以開始數(shù)據(jù)分析了。但是為了避免頻繁地去貓眼請(qǐng)求數(shù)據(jù)，需要將數(shù)據(jù)存儲(chǔ)起來，在這里，筆者使用的是 SQLite3，放到數(shù)據(jù)庫中，更加方便后續(xù)的處理。存儲(chǔ)數(shù)據(jù)的代碼如下：

def saveCommentInfo(moveId, nikename, comment, rate, city, start_time)
 conn = sqlite3.connect('unknow_name.db')
 conn.text_factory=str
 cursor = conn.cursor()
 ins="insert into comments values (?,?,?,?,?,?)"
 v = (moveId, nikename, comment, rate, city, start_time)
 cursor.execute(ins,v)
 cursor.close()
 conn.commit()
 conn.close()

數(shù)據(jù)處理

因?yàn)榍拔奈覀兪鞘褂脭?shù)據(jù)庫來進(jìn)行數(shù)據(jù)存儲(chǔ)的，因此可以直接使用 SQL 來查詢自己想要的結(jié)果，比如評(píng)論前五的城市都有哪些：

SELECT city, count(*) rate_count FROM comments GROUP BY city ORDER BY rate_count DESC LIMIT 5

結(jié)果如下：

從上面的數(shù)據(jù)，我們可以看出來，來自北京的評(píng)論數(shù)最多。

不僅如此，還可以使用更多的 SQL 語句來查詢想要的結(jié)果。比如每個(gè)評(píng)分的人數(shù)、所占的比例等。如筆者有興趣，可以嘗試著去查詢一下數(shù)據(jù)，就是如此地簡單。

而為了更好地展示數(shù)據(jù)，我們使用 Pyecharts 這個(gè)庫來進(jìn)行數(shù)據(jù)可視化展示。

根據(jù)從貓眼拿到的數(shù)據(jù)，按照地理位置，直接使用 Pyecharts 來在中國地圖上展示數(shù)據(jù)：

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
city = data.groupby(['city'])
city_com = city['rate'].agg(['mean','count'])
city_com.reset_index(inplace=True)
data_map = [(city_com['city'][i],city_com['count'][i]) for i in range(0,city_com.shape[0])]
geo = Geo("GEO 地理位置分析",title_pos = "center",width = 1200,height = 800)
while True:
 try:
  attr,val = geo.cast(data_map)
  geo.add("",attr,val,visual_range=[0,300],visual_text_color="#fff",
    symbol_size=10, is_visualmap=True,maptype='china')

 except ValueError as e:
  e = e.message.split("No coordinate is specified for ")[1]
  data_map = filter(lambda item: item[0] != e, data_map)
 else :
  break
geo.render('geo_city_location.html')

注：使用 Pyecharts 提供的數(shù)據(jù)地圖中，有一些貓眼數(shù)據(jù)中的城市找不到對(duì)應(yīng)的從標(biāo)，所以在代碼中，GEO 添加出錯(cuò)的城市，我們將其直接刪除，過濾掉了不少的數(shù)據(jù)。

使用 Python，就是如此簡單地生成了如下地圖：

從可視化數(shù)據(jù)中可以看出，既看電影又評(píng)論的人群主要分布在中國東部，又以北京、上海、成都、深圳最多。雖然能從圖上看出來很多數(shù)據(jù)，但還是不夠直觀，如果想看到每個(gè)省/市的分布情況，我們還需要進(jìn)一步處理數(shù)據(jù)。

而在從貓眼中拿到的數(shù)據(jù)中，城市包含數(shù)據(jù)中具備縣城的數(shù)據(jù)，所以需要將拿到的數(shù)據(jù)做一次轉(zhuǎn)換，將所有的縣城轉(zhuǎn)換到對(duì)應(yīng)省市里去，然后再將同一個(gè)省市的評(píng)論數(shù)量相加，得到最后的結(jié)果。

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
city = data.groupby(['city'])
city_com = city['rate'].agg(['mean','count'])
city_com.reset_index(inplace=True)
fo = open("citys.json",'r')
citys_info = fo.readlines()
citysJson = json.loads(str(citys_info[0]))
data_map_all = [(getRealName(city_com['city'][i], citysJson),city_com['count'][i]) for i in range(0,city_com.shape[0])]
data_map_list = {}
for item in data_map_all:
 if data_map_list.has_key(item[0]):
  value = data_map_list[item[0]]
  value += item[1]
  data_map_list[item[0]] = value
 else:
  data_map_list[item[0]] = item[1]
data_map = [(realKeys(key), data_map_list[key] ) for key in data_map_list.keys()]
def getRealName(name, jsonObj):
 for item in jsonObj:
  if item.startswith(name) :
   return jsonObj[item]
 return name
def realKeys(name):
 return name.replace(u"省", "").replace(u"市", "")
    .replace(u"回族自治區(qū)", "").replace(u"維吾爾自治區(qū)", "")
    .replace(u"壯族自治區(qū)", "").replace(u"自治區(qū)", "")

經(jīng)過上面的數(shù)據(jù)處理，使用 Pyecharts 提供的 map 來生成一個(gè)按省/市來展示的地圖：

def generateMap(data_map):
 map = Map("城市評(píng)論數(shù)", width= 1200, height = 800, title_pos="center")
 while True:
  try:
   attr,val = geo.cast(data_map)
   map.add("",attr,val,visual_range=[0,800],
     visual_text_color="#fff",symbol_size=5,
     is_visualmap=True,maptype='china',
     is_map_symbol_show=False,is_label_show=True,is_roam=False,
     )
  except ValueError as e:
   e = e.message.split("No coordinate is specified for ")[1]
   data_map = filter(lambda item: item[0] != e, data_map)
  else :
   break
 map.render('city_rate_count.html')

當(dāng)然，我們還可以來可視化一下每一個(gè)評(píng)分的人數(shù)，這個(gè)地方采用柱狀圖來顯示：

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
# 按評(píng)分分類
rateData = data.groupby(['rate'])
rateDataCount = rateData["date"].agg([ "count"])
rateDataCount.reset_index(inplace=True)
count = rateDataCount.shape[0] - 1
attr = [rateDataCount["rate"][count - i] for i in range(0, rateDataCount.shape[0])]
v1 = [rateDataCount["count"][count - i] for i in range(0, rateDataCount.shape[0])]
bar = Bar("評(píng)分?jǐn)?shù)量")
bar.add("數(shù)量",attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2,
  xaxis_interval=0,is_splitline_show=True)
bar.render("html/rate_count.html")

畫出來的圖，如下所示，在貓眼的數(shù)據(jù)中，五星好評(píng)的占比超過了 50%，比豆瓣上 34.8% 的五星數(shù)據(jù)好很多。

從以上觀眾分布和評(píng)分的數(shù)據(jù)可以看到，這一部劇，觀眾朋友還是非常地喜歡。前面，從貓眼拿到了觀眾的評(píng)論數(shù)據(jù)?，F(xiàn)在，筆者將通過 jieba 把評(píng)論進(jìn)行分詞，然后通過 Wordcloud 制作詞云，來看看，觀眾朋友們對(duì)《無名之輩》的整體評(píng)價(jià)：

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
comment = jieba.cut(str(data['comment']),cut_all=False)
wl_space_split = " ".join(comment)
backgroudImage = np.array(Image.open(r"./unknow_3.png"))
stopword = STOPWORDS.copy()
wc = WordCloud(width=1920,height=1080,background_color='white',
 mask=backgroudImage,
 font_path="./Deng.ttf",
 stopwords=stopword,max_font_size=400,
 random_state=50)
wc.generate_from_text(wl_space_split)
plt.imshow(wc)
plt.axis("off")
wc.to_file('unknow_word_cloud.png')

導(dǎo)出：

到此這篇關(guān)于用Python 爬取貓眼電影數(shù)據(jù)分析《無名之輩》的文章就介紹到這了,更多相關(guān)Python 爬取貓眼電影數(shù)據(jù)分析《無名之輩》內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: