Python利用lxml模塊爬取豆瓣讀書排行榜的方法與分析

更新時間：2019年04月15日 09:35:42 作者：Berryguo

這篇文章主要給大家介紹了關于Python爬蟲利用lxml模塊爬取豆瓣讀書排行榜的相關資料，文中通過示例代碼介紹的非常詳細，對大家學習或者使用Python具有一定的參考學習價值，需要的朋友們下面來一起學習學習吧

前言

上次使用了BeautifulSoup庫爬取電影排行榜，爬取相對來說有點麻煩，爬取的速度也較慢。本次使用的lxml庫，我個人是最喜歡的，爬取的語法很簡單，爬取速度也快。

本次爬取的豆瓣書籍排行榜的首頁地址是：

https://www.douban.com/doulist/1264675/?start=0&sort=time&playable=0&sub_type=

該排行榜一共有22頁，且發(fā)現(xiàn)更改網(wǎng)址的 start=0 的 0 為25、50就可以跳到排行榜的第二、第三頁，所以后面只需更改這個數(shù)字然后通過遍歷就可以爬取整個排行榜的書籍信息。

本次爬取的內容有書名、評分、評價數(shù)、出版社、出版年份以及書籍封面圖，封面圖保存為圖片，其他數(shù)據(jù)存為csv文件，方面后面讀取分析。

本次的項目步驟：一、分析網(wǎng)頁，確定爬取數(shù)據(jù)

　　　　　　　　二、使用lxml庫爬取內容并保存

　　　　　　　　三、讀取數(shù)據(jù)并選擇部分內容進行分析

步驟一：

分析網(wǎng)頁源代碼可以看到，書籍信息在屬性為的div標簽中,打開發(fā)現(xiàn)，我們需要爬取的信息都在標簽內部，通過xpath語法我們可以很簡便的爬取所需內容。

(書籍各類信息所在標簽）

所需爬取的內容在 class為post、title、rating、abstract的div標簽中。

步驟二：

先定義爬取函數(shù)，爬取所需內容執(zhí)行函數(shù)，并存入csv文件

具體代碼如下：　　

import requests
from lxml import etree
import time
import csv

#信息頭
headers = {
 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}

#定義爬取函數(shù)
def douban_booksrank(url):
 res = requests.get(url, headers=headers)
 selector = etree.HTML(res.text)
 contents = selector.xpath('//div[@class="article"]/div[contains(@class,"doulist-item")]') #循環(huán)點
 for content in contents:
 try:
 title = content.xpath('div/div[2]/div[3]/a/text()')[0] #書名
 scores = content.xpath('div/div[2]/div[4]/span[2]/text()') #評分
 scores.append('9.0') #因為有一些書沒有評分，導致列表為空，此處添加一個默認評分，若無評分則默認為9.0
 score = scores[0]
 comments = content.xpath('div/div[2]/div[4]/span[3]/text()')[0] #評論數(shù)量
 author = content.xpath('div/div[2]/div[5]/text()[1]')[0] #作者
 publishment = content.xpath('div/div[2]/div[5]/text()[2]')[0] #出版社
 pub_year = content.xpath('div/div[2]/div[5]/text()[3]')[0] #出版時間
 img_url = content.xpath('div/div[2]/div[2]/a/img/@src')[0] #書本圖片的網(wǎng)址
 img = requests.get(img_url) #解析圖片網(wǎng)址，為下面下載圖片
 img_name_file = 'C:/Users/lenovo/Desktop/douban_books/{}.png'.format((title.strip())[:3]) #圖片存儲位置，圖片名只取前3
 #寫入csv
 with open('C:\\Users\lenovo\Desktop\\douban_books.csv', 'a+', newline='', encoding='utf-8')as fp: #newline 使不隔行
 writer = csv.writer(fp)
 writer.writerow((title, score, comments, author, publishment, pub_year, img_url))
 #下載圖片，為防止圖片名導致格式錯誤，加入try...except
 try:
 with open(img_name_file, 'wb')as imgf:
  imgf.write(img.content)
 except FileNotFoundError or OSError:
 pass
 time.sleep(0.5) #睡眠0.5s
 except IndexError:
 pass
#執(zhí)行程序
if __name__=='__main__':
 #爬取所有書本，共22頁的內容
 urls = ['https://www.douban.com/doulist/1264675/?start={}&sort=time&playable=0&sub_type='.format(str(i))for i in range(0,550,25)]
 #寫csv首行
 with open('C:\\Users\lenovo\Desktop\\douban_books.csv', 'a+', newline='', encoding='utf-8')as f:
 writer = csv.writer(f)
 writer.writerow(('title', 'score', 'comment', 'author', 'publishment', 'pub_year', 'img_url'))
 #遍歷所有網(wǎng)頁，執(zhí)行爬取程序
 for url in urls:
 douban_booksrank(url)

爬取結果截圖如下：

步驟三：

本次使用Python常用的數(shù)據(jù)分析庫pandas來提取所需內容。pandas的read_csv()函數(shù)可以讀取csv文件并根據(jù)文件格式轉換為Series、DataFrame或面板對象。

此處我們提取的數(shù)據(jù)轉變?yōu)镈ataFrame（數(shù)據(jù)幀）對象，然后通過Matplotlib繪圖庫來進行繪圖。

具體代碼如下：

from matplotlib import pyplot as plt
import pandas as pd
import re

plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標簽
plt.rcParams['axes.unicode_minus']=False #用來正常顯示負號
plt.subplots_adjust(wsapce=0.5, hspace=0.5) #調整subplot子圖間的距離

pd.set_option('display.max_rows', None) #設置使dataframe 所有行都顯示

df = pd.read_csv('C:\\Users\lenovo\Desktop\\douban_books.csv') #讀取csv文件，并賦為dataframe對象

comment = re.findall('\((.*?)人評價', str(df.comment), re.S) #使用正則表達式獲取評論人數(shù)
#將comment的元素化為整型
new_comment = []
for i in comment:
 new_comment.append(int(i))

pub_year = re.findall(r'\d{4}', str(df.pub_year),re.S) #獲取書籍出版年份
#同上
new_pubyear = []
for n in pub_year:
 new_pubyear.append(int(n))

#繪圖
#1、繪制書籍評分范圍的直方圖
plt.subplot(2,2,1)
plt.hist(df.score, bins=16, edgecolor='black')
plt.title('豆瓣書籍排行榜評分分布', fontweight=700)
plt.xlabel('scores')
plt.ylabel('numbers')

#繪制書籍評論數(shù)量的直方分布圖
plt.subplot(222)
plt.hist(new_comment, bins=16, color='green', edgecolor='yellow')
plt.title('豆瓣書籍排行榜評價分布', fontweight=700)
plt.xlabel('評價數(shù)')
plt.ylabel('書籍數(shù)量（單位/本）')

#繪制書籍出版年份分布圖
plt.subplot(2,2,3)
plt.hist(new_pubyear, bins=30, color='indigo',edgecolor='blue')
plt.title('書籍出版年份分布', fontweight=700)
plt.xlabel('出版年份/year')
plt.ylabel('書籍數(shù)量/本')

#尋找關系
plt.subplot(224)
plt.bar(new_pubyear,new_comment, color='red', edgecolor='white')
plt.title('書籍出版年份與評論數(shù)量的關系', fontweight=700)
plt.xlabel('出版年份/year')
plt.ylabel('評論數(shù)')

plt.savefig('C:\\Users\lenovo\Desktop\\douban_books_analysis.png') #保存圖片
plt.show()

這里需要注意的是，使用了正則表達式來提取評論數(shù)和出版年份，將其中的符號和文字等剔除。

分析結果如下：