python抓取某汽車網(wǎng)數(shù)據(jù)解析html存入excel示例

更新時間：2013年12月04日 14:34:40 作者：

python抓取某汽車網(wǎng)經(jīng)銷商信息網(wǎng)頁數(shù)據(jù)解析html，這里提供一個示例演示，大家可以根據(jù)需要分析自己網(wǎng)站的數(shù)據(jù)

1、某汽車網(wǎng)站地址

2、使用firefox查看后發(fā)現(xiàn)，此網(wǎng)站的信息未使用json數(shù)據(jù)，而是簡單那的html頁面而已

3、使用pyquery庫中的PyQuery進行html的解析

頁面樣式：

復制代碼代碼如下:

def get_dealer_info(self):
        """獲取經(jīng)銷商信息"""
        css_select = 'html body div.box div.news_wrapper div.main div.news_list div.service_main div table tr '
        #使用火狐瀏覽器中的自動復制css路徑得到需要位置數(shù)據(jù)
        page = urllib2.urlopen(self.entry_url).read()
        #讀取頁面
        page = page.replace('<br />','&')
        page = page.replace('<br/>','&')
        #由于頁面中的電話信息中使用了br換行，所以在抓取的時候會產(chǎn)生問題
        #問題是：如果取得一對標簽中的數(shù)據(jù)，中包含<br/>,會出現(xiàn)值得到br之前的數(shù)據(jù)，而后的數(shù)據(jù)將得不到，原因個人認為是解析html是會任務(wù)/>結(jié)尾標準        
        d = pq(page)
        #使用PyQuery解析頁面，此處pq=PyQuery,因為from pyquery import PyQuery as pq
        dealer_list = []
        #創(chuàng)建列表用于提交到存儲方法
        for dealer_div in d(css_select):
            #此處定位tr，具體數(shù)據(jù)在此標簽中的td標簽內(nèi)
            p = dealer_div.findall('td')
            #此處p就是一個tr標簽內(nèi)，全部td數(shù)據(jù)的集合
            dealer = {}
            #此處的字典用于存儲一個店鋪的信息用于提交到列表中
            if len(p)==1:
                #此處多哥if判斷是用于對數(shù)據(jù)進行處理，因為一些格式不符合最終數(shù)據(jù)的要求，需要剔除，這個快的代碼按需求而定
                print '@'
            elif len(p)==6 :
                strp = p[0].text.strip()
                dealer[Constant.CITY] = p[1].text.strip()
                strc = p[2].text.strip()

                dealer[Constant.PROVINCE] = p[0].text.strip()
                dealer[Constant.CITY] = p[1].text.strip()
                dealer[Constant.NAME] = p[2].text.strip()
                dealer[Constant.ADDRESSTYPE] = p[3].text.strip()
                dealer[Constant.ADDRESS] = p[4].text.strip()
                dealer[Constant.TELPHONE] = p[5].text.strip()
                dealer_list.append(dealer)  
            elif len(p)==5:
                if p[0].text.strip() != u'省份':
                    dealer[Constant.PROVINCE] = strp
                    dealer[Constant.CITY] = p[0].text.strip()
                    dealer[Constant.NAME] = p[1].text.strip()
                    dealer[Constant.ADDRESSTYPE] = p[2].text.strip()
                    dealer[Constant.ADDRESS] = p[3].text.strip()
                    dealer[Constant.TELPHONE] = p[4].text.strip()
                    dealer_list.append(dealer)
            elif len(p)==3:
                print '@@'
        print '@@@'
        self.saver.add(dealer_list)
        self.saver.commit()