快捷導(dǎo)航

python爬蟲系列Selenium定向爬取虎撲籃球圖片詳解

更新時(shí)間：2017年11月15日 15:01:02 作者：Eastmount

這篇文章主要介紹了python爬蟲系列Selenium定向爬取虎撲籃球圖片詳解，具有一定參考價(jià)值，喜歡的朋友可以了解下。

前言：

作為一名從小就看籃球的球迷，會(huì)經(jīng)常逛虎撲籃球及濕乎乎等論壇，在論壇里面會(huì)存在很多精美圖片，包括NBA球隊(duì)、CBA明星、花邊新聞、球鞋美女等等，如果一張張右鍵另存為的話真是手都點(diǎn)疼了。作為程序員還是寫個(gè)程序來進(jìn)行吧！

所以我通過Python+Selenium+正則表達(dá)式+urllib2進(jìn)行海量圖片爬取。

運(yùn)行效果：

http://photo.hupu.com/nba/tag/馬刺

http://photo.hupu.com/nba/tag/陳露

源代碼：

# -*- coding: utf-8 -*- 
""" 
Crawling pictures by selenium and urllib
url: 虎撲 馬刺 http://photo.hupu.com/nba/tag/%E9%A9%AC%E5%88%BA
url: 虎撲 陳露 http://photo.hupu.com/nba/tag/%E9%99%88%E9%9C%B2
Created on 2015-10-24
@author: Eastmount CSDN 
""" 
 
import time   
import re   
import os 
import sys 
import urllib 
import shutil 
import datetime 
from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
import selenium.webdriver.support.ui as ui  
from selenium.webdriver.common.action_chains import ActionChains 
 
#Open PhantomJS 
driver = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe")
#driver = webdriver.Firefox() 
wait = ui.WebDriverWait(driver,10) 
 
#Download one Picture By urllib 
def loadPicture(pic_url, pic_path): 
 pic_name = os.path.basename(pic_url) #刪除路徑獲取圖片名字
 pic_name = pic_name.replace('*','') #去除'*' 防止錯(cuò)誤 invalid mode ('wb') or filename
 urllib.urlretrieve(pic_url, pic_path + pic_name)
 
 
#爬取具體的圖片及下一張
def getScript(elem_url, path, nums):
 try:
  #由于鏈接 http://photo.hupu.com/nba/p29556-1.html
  #只需拼接 http://..../p29556-數(shù)字.html 省略了自動(dòng)點(diǎn)擊"下一張"操作
  count = 1
  t = elem_url.find(r'.html')
  while (count <= nums):
   html_url = elem_url[:t] + '-' + str(count) + '.html'
   #print html_url
   '''
   driver_pic.get(html_url)
   elem = driver_pic.find_element_by_xpath("http://div[@class='pic_bg']/div/img")
   url = elem.get_attribute("src")
   '''
   #采用正則表達(dá)式獲取第3個(gè)<div></div> 再獲取圖片URL進(jìn)行下載
   content = urllib.urlopen(html_url).read()
   start = content.find(r'<div class="flTab">')
   end = content.find(r'<div class="comMark" style>')
   content = content[start:end]
   div_pat = r'<div.*?>(.*?)<\/div>'
   div_m = re.findall(div_pat, content, re.S|re.M)
   #print div_m[2]
   link_list = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", div_m[2])
   #print link_list
   url = link_list[0] #僅僅一條url鏈接
   loadPicture(url, path)
   count = count + 1

 except Exception,e: 
  print 'Error:',e 
 finally: 
  print 'Download ' + str(count) + ' pictures\n' 
 
  
#爬取主頁圖片集的URL和主題 
def getTitle(url): 
 try: 
  #爬取URL和標(biāo)題 
  count = 0 
  print 'Function getTitle(key,url)' 
  driver.get(url) 
  wait.until(lambda driver: driver.find_element_by_xpath("http://div[@class='piclist3']"))
  print 'Title: ' + driver.title + '\n'
  
  #縮略圖片url(此處無用) 圖片數(shù)量 標(biāo)題(文件名) 注意順序
  elem_url = driver.find_elements_by_xpath("http://a[@class='ku']/img")
  elem_num = driver.find_elements_by_xpath("http://div[@class='piclist3']/table/tbody/tr/td/dl/dd[1]")
  elem_title = driver.find_elements_by_xpath("http://div[@class='piclist3']/table/tbody/tr/td/dl/dt/a")
  for url in elem_url: 
   pic_url = url.get_attribute("src")
   html_url = elem_title[count].get_attribute("href")
   print elem_title[count].text
   print html_url 
   print pic_url
   print elem_num[count].text
   
   #創(chuàng)建圖片文件夾
   path = "E:\\Picture_HP\\" + elem_title[count].text + "\\"
   m = re.findall(r'(\w*[0-9]+)\w*', elem_num[count].text) #爬蟲圖片張數(shù)
   nums = int(m[0])
   count = count + 1 
   if os.path.isfile(path):   #Delete file 
    os.remove(path) 
   elif os.path.isdir(path):  #Delete dir 
    shutil.rmtree(path, True) 
   os.makedirs(path)    #create the file directory 
   getScript(html_url, path, nums) #visit pages
     
 except Exception,e: 
  print 'Error:',e 
 finally: 
  print 'Find ' + str(count) + ' pages with key\n' 
  
#Enter Function 
def main(): 
 #Create Folder 
 basePathDirectory = "E:\\Picture_HP" 
 if not os.path.exists(basePathDirectory): 
  os.makedirs(basePathDirectory) 
 
 #Input the Key for search str=>unicode=>utf-8 
 key = raw_input("Please input a key: ").decode(sys.stdin.encoding) 
 print 'The key is : ' + key 
 
 #Set URL List Sum:1-2 Pages 
 print 'Ready to start the Download!!!\n\n' 
 starttime = datetime.datetime.now() 
 num=1 
 while num<=1:
  #url = 'http://photo.hupu.com/nba/tag/%E9%99%88%E9%9C%B2?p=2&o=1'
  url = 'http://photo.hupu.com/nba/tag/%E9%A9%AC%E5%88%BA'  
  print '第'+str(num)+'頁','url:'+url 
  #Determine whether the title contains key 
  getTitle(url) 
  time.sleep(2) 
  num = num + 1 
 else: 
  print 'Download Over!!!' 
 
 #get the runtime 
 endtime = datetime.datetime.now() 
 print 'The Running time : ',(endtime - starttime).seconds 
   
main()

代碼解析：

源程序主要步驟如下：

1.入口main函數(shù)中，在E盤下創(chuàng)建圖片文件夾Picture_HP，然后輸入圖集url，本打算輸入tag來進(jìn)行訪問的，因?yàn)閁RL如下：

http://photo.hupu.com/nba/tag/馬刺

但是解析URL中文總是錯(cuò)誤，故改成輸入U(xiǎn)RL，這不影響大局。同時(shí)你可能發(fā)現(xiàn)了代碼中while循環(huán)條件為num<=1，它只執(zhí)行一次，建議需要下載哪頁圖集，就賦值URL即可。但是虎撲的不同頁鏈接如下，通過分析URL拼接也是可以實(shí)現(xiàn)循環(huán)獲取所有頁的。

http://photo.hupu.com/nba/tag/%E9%99%88%E9%9C%B2?p=2&o=1

2.調(diào)用getTitle(rul)函數(shù)，通過Selenium和Phantomjs分析HTML的DOM結(jié)構(gòu)，通過find_elements_by_xpath函數(shù)獲取原圖路徑URL、圖集的主題和圖片數(shù)量。如圖：

通過該函數(shù)即可獲取每個(gè)圖集的主題、URL及圖片個(gè)數(shù)，同時(shí)根據(jù)圖集主題創(chuàng)建相應(yīng)的文件夾，代碼中涉及正則表達(dá)式獲取圖片數(shù)量，從"共19張"到數(shù)字"19"。如圖：

3.再調(diào)用函數(shù)getScript(elem_url, path, nums)，參數(shù)分別是圖片url、保存路徑和圖片數(shù)量。那么如何獲取下一張圖片的URL呢？

當(dāng)通過步驟二爬取了圖集URL，如：http://photo.hupu.com/nba/p29556.html

(1).如果是通過Ajax、JavaScript動(dòng)態(tài)加載的圖片，url無規(guī)律則需要調(diào)用Selenium動(dòng)態(tài)模擬鼠標(biāo)操作點(diǎn)擊“下一張”來獲取原圖url；

(2).但很多網(wǎng)站都會(huì)存在一些規(guī)律，如虎撲的第九張圖片鏈接如下，通過URL字符串分割處理即可實(shí)現(xiàn)："p29556-"+"數(shù)字"+".html"即可。

http://photo.hupu.com/nba/p29556-9.html

在該函數(shù)中，我第一次也是通過Selenium分析HTML結(jié)構(gòu)獲取原始圖片url，但每張圖片都需要調(diào)用一次Phantomjs無界面瀏覽器，這速度太慢了。故該成了正則表達(dá)式獲取HTML中的原圖URL，其原因如下圖：

虎撲又偷懶了，它在下面定義了原圖鏈接，直接獲取即可。

4.最后一步即urllib.urlretrieve(pic_url, pic_path + pic_name)下載圖片即可。

當(dāng)然你可能會(huì)遇到錯(cuò)誤“Error: [Errno 22] invalid mode ('wb') or filename”，參考 stackoverflow

總結(jié)：

這是一篇講述Selenium和Python爬取虎撲圖集的文章，文章內(nèi)容算是爬蟲里面比較基礎(chǔ)的，其中下載的“陳露”圖片和網(wǎng)站給出的34個(gè)圖集、902張圖片一樣。同時(shí)采用正則后時(shí)間估計(jì)3分鐘左右，很快~當(dāng)然，虎撲里面的標(biāo)簽很多，足球應(yīng)該也是類似，只要修改URL即可下載圖集，非常之方便。

以上就是本文關(guān)于python爬蟲系列Selenium定向爬取虎撲籃球圖片詳解的全部內(nèi)容，希望對(duì)大家有所幫助。感興趣的朋友可以繼續(xù)參閱本站：

Python爬蟲實(shí)例爬取網(wǎng)站搞笑段子

Python探索之爬取電商售賣信息代碼示例

python中requests爬去網(wǎng)頁內(nèi)容出現(xiàn)亂碼問題解決方法介紹

如有不足之處，歡迎留言指出。

您可能感興趣的文章: