python書籍信息爬蟲實(shí)例
python書籍信息爬蟲示例,供大家參考,具體內(nèi)容如下
背景說明
需要收集一些書籍信息,以豆瓣書籍條目作為源,得到一些有效書籍信息,并保存到本地?cái)?shù)據(jù)庫(kù)。
獲取書籍分類標(biāo)簽
具體可參考這個(gè)鏈接:
https://book.douban.com/tag/?view=type
然后將這些分類標(biāo)簽鏈接存到本地某個(gè)文件,存儲(chǔ)內(nèi)容如下
https://book.douban.com/tag/小說 https://book.douban.com/tag/外國(guó)文學(xué) https://book.douban.com/tag/文學(xué) https://book.douban.com/tag/隨筆 https://book.douban.com/tag/中國(guó)文學(xué) https://book.douban.com/tag/經(jīng)典 https://book.douban.com/tag/日本文學(xué) https://book.douban.com/tag/散文 https://book.douban.com/tag/村上春樹 https://book.douban.com/tag/詩(shī)歌 https://book.douban.com/tag/童話 ......
獲取書籍信息,并保存本地?cái)?shù)據(jù)庫(kù)
假設(shè)已經(jīng)建好mysql表,如下:
CREATE TABLE `book_info` ( `id` int(11) NOT NULL AUTO_INCREMENT, `bookid` varchar(64) NOT NULL COMMENT 'book ID', `tag` varchar(32) DEFAULT '' COMMENT '分類目錄', `bookname` varchar(256) NOT NULL COMMENT '書名', `subname` varchar(256) NOT NULL COMMENT '二級(jí)書名', `author` varchar(256) DEFAULT '' COMMENT '作者', `translator` varchar(256) DEFAULT '' COMMENT '譯者', `press` varchar(128) DEFAULT '' COMMENT '出版社', `publishAt` date DEFAULT '0000-00-00' COMMENT '出版日期', `stars` float DEFAULT '0' COMMENT '評(píng)分', `price_str` varchar(32) DEFAULT '' COMMENT '價(jià)格string', `hotcnt` int(11) DEFAULT '0' COMMENT '評(píng)論人數(shù)', `bookdesc` varchar(8192) DEFAULT NULL COMMENT '簡(jiǎn)介', `updateAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '修改日期', PRIMARY KEY (`id`), UNIQUE KEY `idx_bookid` (`bookid`), KEY `idx_bookname` (`bookname`), KEY `hotcnt` (`hotcnt`), KEY `stars` (`stars`), KEY `idx_tag` (`tag`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='書籍信息';
并已實(shí)現(xiàn)相關(guān)爬蟲邏輯,主要用到了BeautifulSoup包,如下:
#!/usr/bin/python
# coding: utf-8
import re
import logging
import requests
import pymysql
import random
import time
import datetime
from hashlib import md5
from bs4 import BeautifulSoup
logging.basicConfig(level=logging.INFO,
format='[%(levelname)s][%(name)s][%(asctime)s]%(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
class DestDB:
Host = "192.168.1.10"
DB = "spider"
Table = "book_info"
User = "test"
Pwd = "123456"
def connect_db(host, db, user, pwd):
conn = pymysql.connect(
host=host,
user=user,
passwd=pwd,
db=db,
charset='utf8',
connect_timeout=3600) #,
# cursorclass=pymysql.cursors.DictCursor)
conn.autocommit(True)
return conn
def disconnect_db(conn, cursor):
cursor.close()
conn.close()
#提取評(píng)價(jià)人數(shù),如果評(píng)價(jià)人數(shù)少于10人,按10人處理
def hotratings(person):
try:
ptext = person.get_text().split()[0]
pc = int(ptext[1:len(ptext)-4])
except ValueError:
pc = int(10)
return pc
# 持久化到數(shù)據(jù)庫(kù)
def save_to_db(tag, book_reslist):
dest_conn = connect_db(DestDB.Host, DestDB.DB, DestDB.User, DestDB.Pwd)
dest_cursor = dest_conn.cursor()
isql = "insert ignore into book_info "
isql += "(`bookid`,`tag`,`author`,`translator`,`bookname`,`subname`,`press`,"
isql += "`publishAt`,`price_str`,`stars`,`hotcnt`,`bookdesc`) values "
isql += ",".join(["(%s)" % ",".join(['%s']*12)]*len(book_reslist))
values = []
for row in book_reslist:
# 暫時(shí)將md5(bookname+author)作為bookid唯一指
bookid = md5(("%s_%s"%(row[0],row[2])).encode('utf-8')).hexdigest()
values.extend([bookid, tag]+row[:10])
dest_cursor.execute(isql, tuple(values))
disconnect_db(dest_conn, dest_cursor)
# 處理每一次訪問的頁(yè)面
def do_parse(tag, url):
page_data = requests.get(url)
soup = BeautifulSoup(page_data.text.encode("utf-8"), "lxml")
# 提取標(biāo)簽信息
tag = url.split("?")[0].split("/")[-1]
# 抓取作者,出版社信息
details = soup.select("#subject_list > ul > li > div.info > div.pub")
# 抓取評(píng)分
scores = soup.select("#subject_list > ul > li > div.info > div.star.clearfix > span.rating_nums")
# 抓取評(píng)價(jià)人數(shù)
persons = soup.select("#subject_list > ul > li > div.info > div.star.clearfix > span.pl")
# 抓取書名
booknames = soup.select("#subject_list > ul > li > div.info > h2 > a")
# 抓取簡(jiǎn)介
descs = soup.select("#subject_list > ul > li > div.info > p")
# 從標(biāo)簽信息中分離內(nèi)容
book_reslist = []
for detail, score, personCnt, bookname, desc in zip(details, scores, persons, booknames, descs):
try:
subtitle = ""
title_strs = [s.replace('\n', '').strip() for s in bookname.strings]
title_strs = [s for s in title_strs if s]
# 部分書籍有二級(jí)書名
if not title_strs:
continue
elif len(title_strs) >= 2:
bookname, subtitle = title_strs[:2]
else:
bookname = title_strs[0]
# 評(píng)分人數(shù)
hotcnt = hotratings(personCnt)
desc = desc.get_text()
stars = float('%.1f' % float(score.get_text() if score.get_text() else "-1"))
author, translator, press, publishAt, price = [""]*5
detail_texts = detail.get_text().replace('\n', '').split("/")
detail_texts = [s.strip() for s in detail_texts]
# 部分書籍無譯者信息
if len(detail_texts) == 4:
author, press, publishAt, price = detail_texts[:4]
elif len(detail_texts) >= 5:
author, translator, press, publishAt, price = detail_texts[:5]
else:
continue
# 轉(zhuǎn)換出版日期為date類型
if re.match('^[\d]{4}-[\d]{1,2}', publishAt):
dts = publishAt.split('-')
publishAt = datetime.date(int(dts[0]), int(dts[1]), 1)
else:
publishAt = datetime.date(1000, 1, 1)
book_reslist.append([author, translator, bookname, subtitle, press,
publishAt, price, stars, hotcnt, desc])
except Exception as e:
logging.error(e)
logging.info("insert count: %d" % len(book_reslist))
if len(book_reslist) > 0:
save_to_db(tag, book_reslist)
book_reslist = []
return len(details)
def main():
with open("book_tags.txt") as fd:
tags = fd.readlines()
for tag in tags:
tag = tag.strip()
logging.info("current tag url: %s" % tag)
for idx in range(0, 1000000, 20):
try:
url = "%s?start=%d&type=T" % (tag.strip(), idx)
cnt = do_parse(tag.split('/')[-1], url)
if cnt < 10:
break
# 睡眠若干秒,降低訪問頻率
time.sleep(random.randint(10, 15))
except Exception as e:
logging.warn("outer_err: %s" % e)
time.sleep(300)
if __name__ == "__main__":
main()
小結(jié)
以上代碼基于python3環(huán)境來運(yùn)行;
需要首先安裝BeautifulSoup: pip install bs4
爬取過程中需要控制好訪問頻率;
需要對(duì)一些信息進(jìn)行異常處理,比如譯者信息、評(píng)論人數(shù)等。
以上就是本文的全部?jī)?nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
相關(guān)文章
python密碼學(xué)換位密碼及換位解密轉(zhuǎn)置加密教程
這篇文章主要為大家介紹了python密碼學(xué)換位密碼及換位解密轉(zhuǎn)置加密教程,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2022-05-05
PyTorch之nn.ReLU與F.ReLU的區(qū)別介紹
這篇文章主要介紹了PyTorch之nn.ReLU與F.ReLU的區(qū)別介紹,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過來看看吧2020-06-06
Python隊(duì)列RabbitMQ 使用方法實(shí)例記錄
這篇文章主要介紹了Python隊(duì)列RabbitMQ 使用方法,結(jié)合實(shí)例形式分析了Python隊(duì)列RabbitMQ創(chuàng)建隊(duì)列發(fā)送消息與創(chuàng)建消費(fèi)者消費(fèi)信息相關(guān)操作技巧,需要的朋友可以參考下2019-08-08
wtfPython—Python中一組有趣微妙的代碼【收藏】
Wtfpython講解了大量的Python編譯器的內(nèi)容。這篇文章主要介紹了wtfPython-Python中一些奇妙的代碼,感興趣的朋友跟隨腳本之家小編一起看看吧2018-08-08
在Ubuntu 20.04中安裝Pycharm 2020.1的圖文教程
這篇文章主要介紹了在Ubuntu 20.04中安裝Pycharm 2020.1的圖文教程,本文通過圖文并茂的形式給大家介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2020-04-04
Python+tkinter編寫一個(gè)最近很火的強(qiáng)制表白神器
這篇文章主要為大家詳細(xì)介紹了Python如何通過tkinter編寫一個(gè)最近很火的強(qiáng)制表白神器,文中的示例代碼講解詳細(xì),感興趣的小伙伴可以跟隨小編一起嘗試一下2023-04-04

