Python爬取讀者并制作成PDF
學(xué)了下beautifulsoup后,做個(gè)個(gè)網(wǎng)絡(luò)爬蟲,爬取讀者雜志并用reportlab制作成pdf..
crawler.py
#!/usr/bin/env python
#coding=utf-8
"""
Author: Anemone
Filename: getmain.py
Last modified: 2015-02-19 16:47
E-mail: anemone@82flex.com
"""
import urllib2
from bs4 import BeautifulSoup
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def getEachArticle(url):
# response = urllib2.urlopen('http://www.52duzhe.com/2015_01/duzh20150104.html')
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)#.decode("utf-8").encode("gbk"))
#for i in soup.find_all('div'):
# print i,1
title=soup.find("h1").string
writer=soup.find(id="pub_date").string.strip()
_from=soup.find(id="media_name").string.strip()
text=soup.get_text()#.encode("utf-8")
main=re.split("BAIDU_CLB.*;",text)
result={"title":title,"writer":writer,"from":_from,"context":main[1]}
return result
#new=open("new.txt","w")
#new.write(result["title"]+"\n\n")
#new.write(result["writer"]+" "+result["from"])
#new.write(result["context"])
#new.close()
def getCatalog(issue):
url=" firstUrl=url+"duzh"+issue+"01.html"
firstUrl=url+"index.html"
duzhe=dict()
response = urllib2.urlopen(firstUrl)
html = response.read()
soup=BeautifulSoup(html)
firstUrl=url+soup.table.a.get("href")
response = urllib2.urlopen(firstUrl)
html = response.read()
soup = BeautifulSoup(html)
all=soup.find_all("h2")
for i in all:
print i.string
duzhe[i.string]=list()
for link in i.parent.find_all("a"):
href=url+link.get("href")
print href
while 1:
try:
article=getEachArticle(href)
break
except:
continue
duzhe[i.string].append(article)
return duzhe
def readDuZhe(duzhe):
for eachColumn in duzhe:
for eachArticle in duzhe[eachColumn]:
print eachArticle["title"]
if __name__ == '__main__':
# issue=raw_input("issue(201501):")
readDuZhe(getCatalog("201424"))
getpdf.py
#!/usr/bin/env python
#coding=utf-8
"""
Author: Anemone
Filename: writetopdf.py
Last modified: 2015-02-20 19:19
E-mail: anemone@82flex.com
"""
#coding=utf-8
import reportlab.rl_config
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.lib import fonts
import copy
from reportlab.platypus import Paragraph, SimpleDocTemplate,flowables
from reportlab.lib.styles import getSampleStyleSheet
import crawler
def writePDF(issue,duzhe):
reportlab.rl_config.warnOnMissingFontGlyphs = 0
pdfmetrics.registerFont(TTFont('song',"simsun.ttc"))
pdfmetrics.registerFont(TTFont('hei',"msyh.ttc"))
fonts.addMapping('song', 0, 0, 'song')
fonts.addMapping('song', 0, 1, 'song')
fonts.addMapping('song', 1, 0, 'hei')
fonts.addMapping('song', 1, 1, 'hei')
stylesheet=getSampleStyleSheet()
normalStyle = copy.deepcopy(stylesheet['Normal'])
normalStyle.fontName ='song'
normalStyle.fontSize = 11
normalStyle.leading = 11
normalStyle.firstLineIndent = 20
titleStyle = copy.deepcopy(stylesheet['Normal'])
titleStyle.fontName ='song'
titleStyle.fontSize = 15
titleStyle.leading = 20
firstTitleStyle = copy.deepcopy(stylesheet['Normal'])
firstTitleStyle.fontName ='song'
firstTitleStyle.fontSize = 20
firstTitleStyle.leading = 20
firstTitleStyle.firstLineIndent = 50
smallStyle = copy.deepcopy(stylesheet['Normal'])
smallStyle.fontName ='song'
smallStyle.fontSize = 8
smallStyle.leading = 8
story = []
story.append(Paragraph("<b>讀者{0}期</b>".format(issue), firstTitleStyle))
for eachColumn in duzhe:
story.append(Paragraph('__'*28, titleStyle))
story.append(Paragraph('<b>{0}</b>'.format(eachColumn), titleStyle))
for eachArticle in duzhe[eachColumn]:
story.append(Paragraph(eachArticle["title"],normalStyle))
story.append(flowables.PageBreak())
for eachColumn in duzhe:
for eachArticle in duzhe[eachColumn]:
story.append(Paragraph("<b>{0}</b>".format(eachArticle["title"]),titleStyle))
story.append(Paragraph(" {0} {1}".format(eachArticle["writer"],eachArticle["from"]),smallStyle))
para=eachArticle["context"].split(" ")
for eachPara in para:
story.append(Paragraph(eachPara,normalStyle))
story.append(flowables.PageBreak())
#story.append(Paragraph("context",normalStyle))
doc = SimpleDocTemplate("duzhe"+issue+".pdf")
print "Writing PDF..."
doc.build(story)
def main(issue):
duzhe=crawler.getCatalog(issue)
writePDF(issue,duzhe)
if __name__ == '__main__':
issue=raw_input("Enter issue(201501):")
main(issue)
以上就是本文的全部內(nèi)容了,希望大家能夠喜歡。
- 通過抓取淘寶評論為例講解Python爬取ajax動(dòng)態(tài)生成的數(shù)據(jù)(經(jīng)典)
- python實(shí)現(xiàn)爬取千萬淘寶商品的方法
- Python使用Scrapy爬取妹子圖
- Python實(shí)現(xiàn)爬取知乎神回復(fù)簡單爬蟲代碼分享
- Python爬取Coursera課程資源的詳細(xì)過程
- python爬取網(wǎng)站數(shù)據(jù)保存使用的方法
- python制作爬蟲爬取京東商品評論教程
- 以視頻爬取實(shí)例講解Python爬蟲神器Beautiful Soup用法
- python爬蟲實(shí)戰(zhàn)之爬取京東商城實(shí)例教程
- 詳解Python靜態(tài)網(wǎng)頁爬取獲取高清壁紙
相關(guān)文章
python實(shí)現(xiàn)報(bào)表自動(dòng)化詳解
這篇文章主要介紹了python實(shí)現(xiàn)報(bào)表自動(dòng)化詳解,涉及python讀,寫excel—xlwt常用功能,xlutils 常用功能,xlwt寫Excel時(shí)公式的應(yīng)用等相關(guān)內(nèi)容,具有一定參考價(jià)值,需要的朋友可以了解下。2017-11-11關(guān)于python導(dǎo)入模塊import與常見的模塊詳解
今天小編就為大家分享一篇關(guān)于python導(dǎo)入模塊import與常見的模塊詳解,具有很好的參考價(jià)值,希望對大家有所幫助。一起跟隨小編過來看看吧2019-08-08Django 連接sql server數(shù)據(jù)庫的方法
這篇文章主要介紹了Django 連接sql server數(shù)據(jù)庫的方法,小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,也給大家做個(gè)參考。一起跟隨小編過來看看吧2018-06-06分析Python list操作為什么會(huì)錯(cuò)誤
這篇文章主要介紹了分析Python list操作為什么會(huì)錯(cuò)誤,python搞數(shù)據(jù)分析,在很多方面python有著比Matlab更大的優(yōu)勢,下面來看看文章具體介紹的相關(guān)內(nèi)容吧,需要的朋友可以參考一下2021-11-11完美解決Django2.0中models下的ForeignKey()問題
這篇文章主要介紹了完美解決Django2.0中models下的ForeignKey()問題,具有很好的參考價(jià)值,希望對大家有所幫助。一起跟隨小編過來看看吧2020-05-05分析語音數(shù)據(jù)增強(qiáng)及python實(shí)現(xiàn)
數(shù)據(jù)增強(qiáng)是一種生成合成數(shù)據(jù)的方法,即通過調(diào)整原始樣本來創(chuàng)建新樣本。這樣我們就可獲得大量的數(shù)據(jù)。這不僅增加了數(shù)據(jù)集的大小,還提供了單個(gè)樣本的多個(gè)變體,這有助于我們的機(jī)器學(xué)習(xí)模型避免過度擬合2021-06-06