快捷導(dǎo)航

python爬取w3shcool的JQuery課程并且保存到本地

更新時(shí)間：2017年04月06日 10:26:31 作者：北漂的雷子

本文主要介紹python爬取w3shcool的JQuery的課程并且保存到本地的方法解析。具有很好的參考價(jià)值。下面跟著小編一起來(lái)看下吧

最近在忙于找工作，閑暇之余，也找點(diǎn)爬蟲(chóng)項(xiàng)目練練手，寫(xiě)寫(xiě)代碼，知道自己是個(gè)菜鳥(niǎo)，但是要多加練習(xí)，書(shū)山有路勤為徑。各位有測(cè)試坑可以給我介紹個(gè)啊，自動(dòng)化，功能，接口都可以做。

首先呢，我們明確需求，很多同學(xué)呢，有事沒(méi)事就想看看一些技術(shù)，比如我想看看JQuery的語(yǔ)法呢，可是我現(xiàn)在沒(méi)有網(wǎng)絡(luò)，手機(jī)上也沒(méi)有電子書(shū)，真的讓我們很難受，那么別著急啊，你這需求我在這里滿足你，首先呢，你的需求是獲取JQuery的語(yǔ)法的，那么我在看到這個(gè)需求，我有響應(yīng)的網(wǎng)站那么我們接下來(lái)去分析這個(gè)網(wǎng)站。http://www.w3school.com.cn/jquery/jquery_syntax.asp 這是語(yǔ)法url， http://www.w3school.com.cn/jquery/jquery_intro.asp 這是簡(jiǎn)介的url，那么我們拿到很多的url分析到，我們的http://www.w3school.com.cn/jquery是相同的，那么我們?cè)趤?lái)分析在界面怎么可以獲取得到這些，我們可以看到右面有相應(yīng)的目標(biāo)欄，那么我們?nèi)シ治鱿?/p>

我們來(lái)看下這些鏈接，。我們可以吧這些鏈接和http://www.w3school.com.cn拼接到一起。然后組成我們新的url，

上代碼

import urllib.request
from bs4 import BeautifulSoup 
import time
def head():
 headers={
 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'
 }
 return headers
def parse_url(url):
 hea=head()
 resposne=urllib.request.Request(url,headers=hea)
 html=urllib.request.urlopen(resposne).read().decode('gb2312')
 return html
def url_s():
 url='http://www.w3school.com.cn/jquery/index.asp'
 html=parse_url(url)
 soup=BeautifulSoup(html)
 me=soup.find_all(id='course')
 m_url_text=[]
 m_url=[]
 for link in me:
  m_url_text.append(link.text)
  m=link.find_all('a')
  for i in m:
   m_url.append(i.get('href'))
 for i in m_url_text:
  h=i.encode('utf-8').decode('utf-8')
  m_url_text=h.split('\n')
 return m_url,m_url_text

這樣我們使用url_s這個(gè)函數(shù)就可以獲取我們所有的鏈接。

['/jquery/index.asp', '/jquery/jquery_intro.asp', '/jquery/jquery_install.asp', '/jquery/jquery_syntax.asp', '/jquery/jquery_selectors.asp', '/jquery/jquery_events.asp', '/jquery/jquery_hide_show.asp', '/jquery/jquery_fade.asp', '/jquery/jquery_slide.asp', '/jquery/jquery_animate.asp', '/jquery/jquery_stop.asp', '/jquery/jquery_callback.asp', '/jquery/jquery_chaining.asp', '/jquery/jquery_dom_get.asp', '/jquery/jquery_dom_set.asp', '/jquery/jquery_dom_add.asp', '/jquery/jquery_dom_remove.asp', '/jquery/jquery_css_classes.asp', '/jquery/jquery_css.asp', '/jquery/jquery_dimensions.asp', '/jquery/jquery_traversing.asp', '/jquery/jquery_traversing_ancestors.asp', '/jquery/jquery_traversing_descendants.asp', '/jquery/jquery_traversing_siblings.asp', '/jquery/jquery_traversing_filtering.asp', '/jquery/jquery_ajax_intro.asp', '/jquery/jquery_ajax_load.asp', '/jquery/jquery_ajax_get_post.asp', '/jquery/jquery_noconflict.asp', '/jquery/jquery_examples.asp', '/jquery/jquery_quiz.asp', '/jquery/jquery_reference.asp', '/jquery/jquery_ref_selectors.asp', '/jquery/jquery_ref_events.asp', '/jquery/jquery_ref_effects.asp', '/jquery/jquery_ref_manipulation.asp', '/jquery/jquery_ref_attributes.asp', '/jquery/jquery_ref_css.asp', '/jquery/jquery_ref_ajax.asp', '/jquery/jquery_ref_traversing.asp', '/jquery/jquery_ref_data.asp', '/jquery/jquery_ref_dom_element_methods.asp', '/jquery/jquery_ref_core.asp', '/jquery/jquery_ref_prop.asp'], ['jQuery 教程', '', 'jQuery 教程', 'jQuery 簡(jiǎn)介', 'jQuery 安裝', 'jQuery 語(yǔ)法', 'jQuery 選擇器', 'jQuery 事件', '', 'jQuery 效果', '', 'jQuery 隱藏/顯示', 'jQuery 淡入淡出', 'jQuery 滑動(dòng)', 'jQuery 動(dòng)畫(huà)', 'jQuery stop()', 'jQuery Callback', 'jQuery Chaining', '', 'jQuery HTML', '', 'jQuery 獲取', 'jQuery 設(shè)置', 'jQuery 添加', 'jQuery 刪除', 'jQuery CSS 類', 'jQuery css()', 'jQuery 尺寸', '', 'jQuery 遍歷', '', 'jQuery 遍歷', 'jQuery 祖先', 'jQuery 后代', 'jQuery 同胞', 'jQuery 過(guò)濾', '', 'jQuery AJAX', '', 'jQuery AJAX 簡(jiǎn)介', 'jQuery 加載', 'jQuery Get/Post', '', 'jQuery 雜項(xiàng)', '', 'jQuery noConflict()', '', 'jQuery 實(shí)例', '', 'jQuery 實(shí)例', 'jQuery 測(cè)驗(yàn)', '', 'jQuery 參考手冊(cè)', '', 'jQuery 參考手冊(cè)', 'jQuery 選擇器', 'jQuery 事件', 'jQuery 效果', 'jQuery 文檔操作', 'jQuery 屬性操作', 'jQuery CSS 操作', 'jQuery Ajax', 'jQuery 遍歷', 'jQuery 數(shù)據(jù)', 'jQuery DOM 元素', 'jQuery 核心', 'jQuery 屬性', '', ''])

這是所有鏈接還有對(duì)應(yīng)鏈接的所對(duì)應(yīng)的語(yǔ)法模塊的名字。那么我們接下來(lái)就是去拼接urls，使用的是str的拼接

 ['http://www.w3school.com.cn//jquery/index.asp', 'http://www.w3school.com.cn//jquery/jquery_intro.asp', 'http://www.w3school.com.cn//jquery/jquery_install.asp', 'http://www.w3school.com.cn//jquery/jquery_syntax.asp', 'http://www.w3school.com.cn//jquery/jquery_selectors.asp', 'http://www.w3school.com.cn//jquery/jquery_events.asp', 'http://www.w3school.com.cn//jquery/jquery_hide_show.asp', 'http://www.w3school.com.cn//jquery/jquery_fade.asp', 'http://www.w3school.com.cn//jquery/jquery_slide.asp', 'http://www.w3school.com.cn//jquery/jquery_animate.asp', 'http://www.w3school.com.cn//jquery/jquery_stop.asp', 'http://www.w3school.com.cn//jquery/jquery_callback.asp', 'http://www.w3school.com.cn//jquery/jquery_chaining.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_get.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_set.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_add.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_remove.asp', 'http://www.w3school.com.cn//jquery/jquery_css_classes.asp', 'http://www.w3school.com.cn//jquery/jquery_css.asp', 'http://www.w3school.com.cn//jquery/jquery_dimensions.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_ancestors.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_descendants.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_siblings.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_filtering.asp', 'http://www.w3school.com.cn//jquery/jquery_ajax_intro.asp', 'http://www.w3school.com.cn//jquery/jquery_ajax_load.asp', 'http://www.w3school.com.cn//jquery/jquery_ajax_get_post.asp', 'http://www.w3school.com.cn//jquery/jquery_noconflict.asp', 'http://www.w3school.com.cn//jquery/jquery_examples.asp', 'http://www.w3school.com.cn//jquery/jquery_quiz.asp', 'http://www.w3school.com.cn//jquery/jquery_reference.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_selectors.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_events.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_effects.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_manipulation.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_attributes.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_css.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_ajax.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_traversing.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_data.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_dom_element_methods.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_core.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_prop.asp']

那么我們有這個(gè)所有的urls，那么我們來(lái)分析下，文章正文。

分析可以得到我們的所有的正文都是在一個(gè)id=maincontent中，那么我們直接解析每個(gè)界面中的id=maincontent的標(biāo)簽，獲取響應(yīng)的text文檔，并且保存就好。

所以我們所有的代碼如下：

import urllib.request
from bs4 import BeautifulSoup 
import time
def head():
 headers={
 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'
 }
 return headers
def parse_url(url):
 hea=head()
 resposne=urllib.request.Request(url,headers=hea)
 html=urllib.request.urlopen(resposne).read().decode('gb2312')
 return html
def url_s():
 url='http://www.w3school.com.cn/jquery/index.asp'
 html=parse_url(url)
 soup=BeautifulSoup(html)
 me=soup.find_all(id='course')
 m_url_text=[]
 m_url=[]
 for link in me:
  m_url_text.append(link.text)
  m=link.find_all('a')
  for i in m:
   m_url.append(i.get('href'))
 for i in m_url_text:
  h=i.encode('utf-8').decode('utf-8')
  m_url_text=h.split('\n')
 return m_url,m_url_text
def xml():
 url,url_text=url_s()
 url_jque=[]
 for link in url:
  url_jque.append('http://www.w3school.com.cn/'+link)
 return url_jque
def xiazai():
 urls=xml()
 i=0
 for url in urls:
  html=parse_url(url)
  soup=BeautifulSoup(html)
  me=soup.find_all(id='maincontent')
  with open(r'%s.txt'%i,'wb') as f:
   for h in me:
    f.write(h.text.encode('utf-8'))
    print(i)
  i+=1
if __name__ == '__main__':
 xiazai()

import urllib.request
from bs4 import BeautifulSoup 
import time
def head():
 headers={
 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'
 }
 return headers
def parse_url(url):
 hea=head()
 resposne=urllib.request.Request(url,headers=hea)
 html=urllib.request.urlopen(resposne).read().decode('gb2312')
 return html
def url_s():
 url='http://www.w3school.com.cn/jquery/index.asp'
 html=parse_url(url)
 soup=BeautifulSoup(html)
 me=soup.find_all(id='course')
 m_url_text=[]
 m_url=[]
 for link in me:
  m_url_text.append(link.text)
  m=link.find_all('a')
  for i in m:
   m_url.append(i.get('href'))
 for i in m_url_text:
  h=i.encode('utf-8').decode('utf-8')
  m_url_text=h.split('\n')
 return m_url,m_url_text

def xml():
 url,url_text=url_s()
 url_jque=[]
 for link in url:
  url_jque.append('http://www.w3school.com.cn/'+link)
 return url_jque
def xiazai():
 urls=xml()
 i=0
 for url in urls:
  html=parse_url(url)
  soup=BeautifulSoup(html)
  me=soup.find_all(id='maincontent')
  with open(r'%s.txt'%i,'wb') as f:
   for h in me:
    f.write(h.text.encode('utf-8'))
    print(i)
  i+=1
if __name__ == '__main__':
 xiazai()

結(jié)果