python實(shí)現(xiàn)爬蟲(chóng)統(tǒng)計(jì)學(xué)校BBS男女比例之?dāng)?shù)據(jù)處理(三)
本文主要介紹了數(shù)據(jù)處理方面的內(nèi)容,希望大家仔細(xì)閱讀。
一、數(shù)據(jù)分析

得到了以下列字符串開(kāi)頭的文本數(shù)據(jù),我們需要進(jìn)行處理

二、回滾
我們需要對(duì)httperror的數(shù)據(jù)進(jìn)行再處理
因?yàn)榇a的原因,具體可見(jiàn)本系列文章(二),會(huì)導(dǎo)致文本里面同一個(gè)id連續(xù)出現(xiàn)幾次httperror記錄:
//httperror265001_266001.txt 265002 httperror 265002 httperror 265002 httperror 265002 httperror 265003 httperror 265003 httperror 265003 httperror 265003 httperror
所以我們?cè)诖a里要考慮這種情形,不能每一行的id都進(jìn)行處理,是判斷是否重復(fù)的id。
java里面有緩存方法可以避免頻繁讀取硬盤(pán)上的文件,python其實(shí)也有,可以見(jiàn)這篇文章。
def main():
reload(sys)
sys.setdefaultencoding('utf-8')
global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5
sexRe = re.compile(u'em>\u6027\u522b</em>(.*?)</li')
timeRe = re.compile(u'em>\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4</em>(.*?)</li')
notexistRe = re.compile(u'(p>)\u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728<')
url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'
url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'
file1 = 'ruisi\\correct_re.txt'
file2 = 'ruisi\\errTime_re.txt'
file3 = 'ruisi\\notexist_re.txt'
file4 = 'ruisi\\unkownsex_re.txt'
file5 = 'ruisi\\httperror_re.txt'
#遍歷文件夾里面以httperror開(kāi)頭的文本
for filename in os.listdir(r'E:\pythonProject\ruisi'):
if filename.startswith('httperror'):
count = 0
newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)
readFile = open(newName,'r')
oldLine = '0'
for line in readFile:
#newLine 用來(lái)比較是否是重復(fù)的id
newLine = line
if (newLine != oldLine):
nu = newLine.split()[0]
oldLine = newLine
count += 1
searchWeb((int(nu),))
print "%s deal %s lines" %(filename, count)
本代碼為了簡(jiǎn)便,沒(méi)有再把httperror的那些id分類(lèi),直接存儲(chǔ)為下面這5個(gè)文件里
file1 = 'ruisi\\correct_re.txt' file2 = 'ruisi\\errTime_re.txt' file3 = 'ruisi\\notexist_re.txt' file4 = 'ruisi\\unkownsex_re.txt' file5 = 'ruisi\\httperror_re.txt'
可以看下輸出Log記錄,總共處理了多少個(gè)httperror的數(shù)據(jù)。
"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/reload.py httperror132001-133001.txt deal 21 lines httperror2001-3001.txt deal 4 lines httperror251001-252001.txt deal 5 lines httperror254001-255001.txt deal 1 lines
三、單線(xiàn)程統(tǒng)計(jì)unkownsex 數(shù)據(jù)
代碼簡(jiǎn)單,我們利用單線(xiàn)程統(tǒng)計(jì)一下unkownsex(由于權(quán)限原因無(wú)法獲取、或者該用戶(hù)沒(méi)有填寫(xiě))的用戶(hù)。另外,經(jīng)過(guò)我們檢查,沒(méi)有性別的用戶(hù)也是沒(méi)有活動(dòng)時(shí)間的。
數(shù)據(jù)格式如下:
253042 unkownsex
253087 unkownsex
253102 unkownsex
253118 unkownsex
253125 unkownsex
253136 unkownsex
253161 unkownsex
import os,time
sumCount = 0
startTime = time.clock()
for filename in os.listdir(r'E:\pythonProject\ruisi'):
if filename.startswith('unkownsex'):
count = 0
newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)
readFile = open(newName,'r')
for line in open(newName):
count += 1
sumCount +=1
print "%s deal %s lines" %(filename, count)
print '%s unkowns sex' %(sumCount)
endTime = time.clock()
print "cost time " + str(endTime - startTime) + " s"
處理速度很快,輸出如下:
unkownsex1-1001.txt deal 204 lines unkownsex100001-101001.txt deal 50 lines unkownsex10001-11001.txt deal 206 lines #...省略中間輸出信息 unkownsex99001-100001.txt deal 56 lines unkownsex_re.txt deal 1085 lines 14223 unkowns sex cost time 0.0813142301261 s
四、單線(xiàn)程統(tǒng)計(jì) correct 數(shù)據(jù)
數(shù)據(jù)格式如下:
31024 男 2014-11-11 13:20 31283 男 2013-3-25 19:41 31340 保密 2015-2-2 15:17 31427 保密 2014-8-10 09:17 31475 保密 2013-7-2 08:59 31554 保密 2014-10-17 17:02 31621 男 2015-5-16 19:27 31872 保密 2015-1-11 16:49 31915 保密 2014-5-4 11:01 31997 保密 2015-5-16 20:14
代碼如下,實(shí)現(xiàn)思路就是一行一行讀取,利用line.split()獲取性別信息。sumCount 是統(tǒng)計(jì)一個(gè)多少人,boycount 、girlcount 、secretcount 分別統(tǒng)計(jì)男、女、保密的人數(shù)。我們還是利用unicode進(jìn)行正則匹配。
import os,sys,time
reload(sys)
sys.setdefaultencoding('utf-8')
startTime = time.clock()
sumCount = 0
boycount = 0
girlcount = 0
secretcount = 0
for filename in os.listdir(r'E:\pythonProject\ruisi'):
if filename.startswith('correct'):
newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)
readFile = open(newName,'r')
for line in readFile:
sexInfo = line.split()[1]
sumCount +=1
if sexInfo == u'\u7537' :
boycount += 1
elif sexInfo == u'\u5973':
girlcount +=1
elif sexInfo == u'\u4fdd\u5bc6':
secretcount +=1
print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)
print "total is %s; %s boys; %s girls; %s secret;" %(sumCount, boycount,girlcount,secretcount)
endTime = time.clock()
print "cost time " + str(endTime - startTime) + " s"
注意,我們輸出的是截止某個(gè)文件的統(tǒng)計(jì)信息,而不是單個(gè)文件的統(tǒng)計(jì)情況。輸出結(jié)果如下:
until correct1-1001.txt, sum is 110 boys; 7 girls; 414 secret; until correct100001-101001.txt, sum is 125 boys; 13 girls; 542 secret; #...省略 until correct99001-100001.txt, sum is 11070 boys; 3113 girls; 26636 secret; until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 secret; total is 46885; 13937 boys; 4007 girls; 28941 secret; cost time 3.60047888495 s
五、多線(xiàn)程統(tǒng)計(jì)數(shù)據(jù)
為了更快統(tǒng)計(jì),我們可以利用多線(xiàn)程。
作為對(duì)比,我們?cè)囅聠尉€(xiàn)程需要的時(shí)間。
# encoding: UTF-8
import threading
import time,os,sys
#全局變量
SUM = 0
BOY = 0
GIRL = 0
SECRET = 0
NUM =0
#本來(lái)繼承自threading.Thread,覆蓋run()方法,用start()啟動(dòng)線(xiàn)程
#這和java里面很像
class StaFileList(threading.Thread):
#文本名稱(chēng)列表
fileList = []
def __init__(self, fileList):
threading.Thread.__init__(self)
self.fileList = fileList
def run(self):
global SUM, BOY, GIRL, SECRET
#可以加上個(gè)耗時(shí)時(shí)間,這樣多線(xiàn)程更加明顯,而不是順序的thread-1,2,3
#time.sleep(1)
#acquire獲取鎖
if mutex.acquire(1):
self.staFiles(self.fileList)
#release釋放鎖
mutex.release()
#處理輸入的files列表,統(tǒng)計(jì)男女人數(shù)
#注意這兒數(shù)據(jù)同步問(wèn)題,global使用全局變量
def staFiles(self, files):
global SUM, BOY, GIRL, SECRET
for name in files:
newName = 'E:\\pythonProject\\ruisi\\%s' % (name)
readFile = open(newName,'r')
for line in readFile:
sexInfo = line.split()[1]
SUM +=1
if sexInfo == u'\u7537' :
BOY += 1
elif sexInfo == u'\u5973':
GIRL +=1
elif sexInfo == u'\u4fdd\u5bc6':
SECRET +=1
# print "thread %s, until %s, total is %s; %s boys; %s girls;" \
# " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)
def test():
#files保存多個(gè)文件,可以設(shè)定一個(gè)線(xiàn)程處理多少個(gè)文件
files = []
#用來(lái)保存所有的線(xiàn)程,方便最后主線(xiàn)程等待所以子線(xiàn)程結(jié)束
staThreads = []
i = 0
for filename in os.listdir(r'E:\pythonProject\ruisi'):
#沒(méi)獲取10個(gè)文本,就創(chuàng)建一個(gè)線(xiàn)程
if filename.startswith('correct'):
files.append(filename)
i+=1
#一個(gè)線(xiàn)程處理20個(gè)文件
if i == 20 :
staThreads.append(StaFileList(files))
files = []
i = 0
#最后剩余的files,很可能長(zhǎng)度不足10個(gè)
if files:
staThreads.append(StaFileList(files))
for t in staThreads:
t.start()
# 主線(xiàn)程中等待所有子線(xiàn)程退出,如果不加這個(gè),速度更快些?
for t in staThreads:
t.join()
if __name__ == '__main__':
reload(sys)
sys.setdefaultencoding('utf-8')
startTime = time.clock()
mutex = threading.Lock()
test()
print "Multi Thread, total is %s; %s boys; %s girls; %s secret;" %(SUM, BOY,GIRL,SECRET)
endTime = time.clock()
print "cost time " + str(endTime - startTime) + " s"
輸出
Multi Thread, total is 46885; 13937 boys; 4007 girls; 28941 secret; cost time 0.132137192794 s
我們發(fā)現(xiàn)時(shí)間和單線(xiàn)程差不多。因?yàn)檫@兒涉及到線(xiàn)程同步問(wèn)題,獲取鎖和釋放鎖都是需要時(shí)間開(kāi)銷(xiāo)的,線(xiàn)程間切換保存中斷和恢復(fù)中斷也都是需要時(shí)間開(kāi)銷(xiāo)的。
六、較多數(shù)據(jù)的單線(xiàn)程和多線(xiàn)程對(duì)比
我們可以對(duì)correct、errTime 、unkownsex的文本都進(jìn)行處理。
單線(xiàn)程代碼
# coding=utf-8
import os,sys,time
reload(sys)
sys.setdefaultencoding('utf-8')
startTime = time.clock()
sumCount = 0
boycount = 0
girlcount = 0
secretcount = 0
unkowncount = 0
for filename in os.listdir(r'E:\pythonProject\ruisi'):
# 有性別、活動(dòng)時(shí)間
if filename.startswith('correct') :
newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)
readFile = open(newName,'r')
for line in readFile:
sexInfo =line.split()[1]
sumCount +=1
if sexInfo == u'\u7537' :
boycount += 1
elif sexInfo == u'\u5973':
girlcount +=1
elif sexInfo == u'\u4fdd\u5bc6':
secretcount +=1
# print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)
#沒(méi)有活動(dòng)時(shí)間,但是有性別
elif filename.startswith("errTime"):
newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)
readFile = open(newName,'r')
for line in readFile:
sexInfo =line.split()[1]
sumCount +=1
if sexInfo == u'\u7537' :
boycount += 1
elif sexInfo == u'\u5973':
girlcount +=1
elif sexInfo == u'\u4fdd\u5bc6':
secretcount +=1
# print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)
#沒(méi)有性別,也沒(méi)有時(shí)間,直接統(tǒng)計(jì)行數(shù)
elif filename.startswith("unkownsex"):
newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)
# count = len(open(newName,'rU').readlines())
#對(duì)于大文件用循環(huán)方法,count 初始值為 -1 是為了應(yīng)對(duì)空行的情況,最后+1得到0行
count = -1
for count, line in enumerate(open(newName, 'rU')):
pass
count += 1
unkowncount += count
sumCount += count
# print "until %s, sum is %s unkownsex" %(filename, unkowncount)
print "Single Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex;" %(sumCount, boycount,girlcount,secretcount,unkowncount)
endTime = time.clock()
print "cost time " + str(endTime - startTime) + " s"
輸出為
Single Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.37444645628 s
多線(xiàn)程代碼
__author__ = 'admin'
# encoding: UTF-8
#多線(xiàn)程處理程序
import threading
import time,os,sys
#全局變量
SUM = 0
BOY = 0
GIRL = 0
SECRET = 0
UNKOWN = 0
class StaFileList(threading.Thread):
#文本名稱(chēng)列表
fileList = []
def __init__(self, fileList):
threading.Thread.__init__(self)
self.fileList = fileList
def run(self):
global SUM, BOY, GIRL, SECRET
if mutex.acquire(1):
self.staManyFiles(self.fileList)
mutex.release()
#處理輸入的files列表,統(tǒng)計(jì)男女人數(shù)
#注意這兒數(shù)據(jù)同步問(wèn)題
def staCorrectFiles(self, files):
global SUM, BOY, GIRL, SECRET
for name in files:
newName = 'E:\\pythonProject\\ruisi\\%s' % (name)
readFile = open(newName,'r')
for line in readFile:
sexInfo = line.split()[1]
SUM +=1
if sexInfo == u'\u7537' :
BOY += 1
elif sexInfo == u'\u5973':
GIRL +=1
elif sexInfo == u'\u4fdd\u5bc6':
SECRET +=1
# print "thread %s, until %s, total is %s; %s boys; %s girls;" \
# " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)
def staManyFiles(self, files):
global SUM, BOY, GIRL, SECRET,UNKOWN
for name in files:
if name.startswith('correct') :
newName = 'E:\\pythonProject\\ruisi\\%s' % (name)
readFile = open(newName,'r')
for line in readFile:
sexInfo = line.split()[1]
SUM +=1
if sexInfo == u'\u7537' :
BOY += 1
elif sexInfo == u'\u5973':
GIRL +=1
elif sexInfo == u'\u4fdd\u5bc6':
SECRET +=1
# print "thread %s, until %s, total is %s; %s boys; %s girls;" \
# " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)
#沒(méi)有活動(dòng)時(shí)間,但是有性別
elif name.startswith("errTime"):
newName = 'E:\\pythonProject\\ruisi\\%s' % (name)
readFile = open(newName,'r')
for line in readFile:
sexInfo = line.split()[1]
SUM +=1
if sexInfo == u'\u7537' :
BOY += 1
elif sexInfo == u'\u5973':
GIRL +=1
elif sexInfo == u'\u4fdd\u5bc6':
SECRET +=1
# print "thread %s, until %s, total is %s; %s boys; %s girls;" \
# " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)
#沒(méi)有性別,也沒(méi)有時(shí)間,直接統(tǒng)計(jì)行數(shù)
elif name.startswith("unkownsex"):
newName = 'E:\\pythonProject\\ruisi\\%s' % (name)
# count = len(open(newName,'rU').readlines())
#對(duì)于大文件用循環(huán)方法,count 初始值為 -1 是為了應(yīng)對(duì)空行的情況,最后+1得到0行
count = -1
for count, line in enumerate(open(newName, 'rU')):
pass
count += 1
UNKOWN += count
SUM += count
# print "thread %s, until %s, total is %s; %s unkownsex" %(self.name, name, SUM, UNKOWN)
def test():
files = []
#用來(lái)保存所有的線(xiàn)程,方便最后主線(xiàn)程等待所以子線(xiàn)程結(jié)束
staThreads = []
i = 0
for filename in os.listdir(r'E:\pythonProject\ruisi'):
#沒(méi)獲取10個(gè)文本,就創(chuàng)建一個(gè)線(xiàn)程
if filename.startswith("correct") or filename.startswith("errTime") or filename.startswith("unkownsex"):
files.append(filename)
i+=1
if i == 20 :
staThreads.append(StaFileList(files))
files = []
i = 0
#最后剩余的files,很可能長(zhǎng)度不足10個(gè)
if files:
staThreads.append(StaFileList(files))
for t in staThreads:
t.start()
# 主線(xiàn)程中等待所有子線(xiàn)程退出
for t in staThreads:
t.join()
if __name__ == '__main__':
reload(sys)
sys.setdefaultencoding('utf-8')
startTime = time.clock()
mutex = threading.Lock()
test()
print "Multi Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex" %(SUM, BOY,GIRL,SECRET,UNKOWN)
endTime = time.clock()
print "cost time " + str(endTime - startTime) + " s"
endTime = time.clock()
print "cost time " + str(endTime - startTime) + " s"
輸出為
Multi Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret;
cost time 1.23049112201 s
可以看出多線(xiàn)程還是優(yōu)于單線(xiàn)程的,由于使用的同步,數(shù)據(jù)統(tǒng)計(jì)是一直的。
注意python在類(lèi)內(nèi)部經(jīng)常需要加上self,這點(diǎn)和java區(qū)別很大。
def __init__(self, fileList):
threading.Thread.__init__(self)
self.fileList = fileList
def run(self):
global SUM, BOY, GIRL, SECRET
if mutex.acquire(1):
#調(diào)用類(lèi)內(nèi)部方法需要加self
self.staFiles(self.fileList)
mutex.release()
total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.25413238673 s
以上就是本文的全部?jī)?nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助。
- python實(shí)現(xiàn)爬蟲(chóng)統(tǒng)計(jì)學(xué)校BBS男女比例之多線(xiàn)程爬蟲(chóng)(二)
- python實(shí)現(xiàn)爬蟲(chóng)統(tǒng)計(jì)學(xué)校BBS男女比例(一)
- Python爬蟲(chóng)教程知識(shí)點(diǎn)總結(jié)
- python爬蟲(chóng)爬取網(wǎng)頁(yè)數(shù)據(jù)并解析數(shù)據(jù)
- Python爬蟲(chóng)實(shí)現(xiàn)自動(dòng)登錄、簽到功能的代碼
- Python爬蟲(chóng)防封ip的一些技巧
- Python爬蟲(chóng)爬取微信朋友圈
- python爬蟲(chóng)使用requests發(fā)送post請(qǐng)求示例詳解
- 學(xué)習(xí)Python爬蟲(chóng)的幾點(diǎn)建議
- python 爬蟲(chóng)基本使用——統(tǒng)計(jì)杭電oj題目正確率并排序
相關(guān)文章
解決Python命令行下退格,刪除,方向鍵亂碼(親測(cè)有效)
今天小編就為大家分享一篇解決Python命令行下退格,刪除,方向鍵亂碼(親測(cè)有效),具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2020-01-01
Python中字符串的常見(jiàn)操作技巧總結(jié)
使用Python實(shí)現(xiàn)不同需求的排行榜功能

