Python 爬取攜程所有機(jī)票的實(shí)例代碼
打開(kāi)攜程網(wǎng),查詢(xún)機(jī)票,如廣州到成都。
這時(shí)網(wǎng)址為:http://flights.ctrip.com/booking/CAN-CTU-day-1.html?DDate1=2018-06-15
其中,CAN 表示廣州,CTU 表示成都,日期 “2018-06-15”就比較明顯了。一般的爬蟲(chóng),只有替換這幾個(gè)值,就可以遍歷了。但觀察發(fā)現(xiàn),有個(gè)鏈接可以看到當(dāng)前網(wǎng)頁(yè)的所有json格式的數(shù)據(jù)。如下
同樣可以看到城市和日期,該連接打開(kāi)的是 json 文件,里面存儲(chǔ)的就是當(dāng)前頁(yè)面的數(shù)據(jù)。顯示如下,其中 "fis" 則是航班信息。
每一次爬取只要替換城市代碼和日期即可,城市代碼自己手動(dòng)整理了一份:
city={"YIE":"阿爾山","AKU":"阿克蘇","RHT":"阿拉善右旗","AXF":"阿拉善左旗","AAT":"阿勒泰","NGQ":"阿里","MFM":"澳門(mén)" ,"AQG":"安慶","AVA":"安順","AOG":"鞍山","RLK":"巴彥淖爾","AEB":"百色","BAV":"包頭","BSD":"保山","BHY":"北海","BJS":"北京" ,"DBC":"白城","NBS":"白山","BFJ":"畢節(jié)","BPL":"博樂(lè)","CKG":"重慶","BPX":"昌都","CGD":"常德","CZX":"常州" ,"CHG":"朝陽(yáng)","CTU":"成都","JUH":"池州","CIF":"赤峰","SWA":"潮州","CGQ":"長(zhǎng)春","CSX":"長(zhǎng)沙","CIH":"長(zhǎng)治","CDE":"承德" ,"CWJ":"滄源","DAX":"達(dá)州","DLU":"大理","DLC":"大連","DQA":"大慶","DAT":"大同","DDG":"丹東","DCY":"稻城","DOY":"東營(yíng)" ,"DNH":"敦煌","DAX":"達(dá)縣","LUM":"德宏","EJN":"額濟(jì)納旗","DSN":"鄂爾多斯","ENH":"恩施","ERL":"二連浩特","FUO":"佛山" ,"FOC":"福州","FYJ":"撫遠(yuǎn)","FUG":"阜陽(yáng)","KOW":"贛州","GOQ":"格爾木","GYU":"固原","GYS":"廣元","CAN":"廣州","KWE":"貴陽(yáng)" ,"KWL":"桂林","HRB":"哈爾濱","HMI":"哈密","HAK":"???,"HLD":"海拉爾","HDG":"邯鄲","HZG":"漢中","HGH":"杭州","HFE":"合肥" ,"HTN":"和田","HEK":"黑河","HET":"呼和浩特","HIA":"淮安","HJJ":"懷化","TXN":"黃山","HUZ":"惠州","JXA":"雞西","TNA":"濟(jì)南" ,"JNG":"濟(jì)寧","JGD":"加格達(dá)奇","JMU":"佳木斯","JGN":"嘉峪關(guān)","SWA":"揭陽(yáng)","JIC":"金昌","KNH":"金門(mén)","JNZ":"錦州" ,"CYI":"嘉義","JHG":"景洪","JSJ":"建三江","JJN":"晉江","JGS":"井岡山","JDZ":"景德鎮(zhèn)","JIU":"九江","JZH":"九寨溝","KHG":"喀什" ,"KJH":"凱里","KGT":"康定","KRY":"克拉瑪依","KCA":"庫(kù)車(chē)","KRL":"庫(kù)爾勒","KMG":"昆明","LXA":"拉薩","LHW":"蘭州","HZH":"黎平" ,"LJG":"麗江","LLB":"荔波","LYG":"連云港","LPF":"六盤(pán)水","LFQ":"臨汾","LZY":"林芝","LNJ":"臨滄","LYI":"臨沂","LZH":"柳州" ,"LZO":"瀘州","LYA":"洛陽(yáng)","LLV":"呂梁","JMJ":"瀾滄","LCX":"龍巖","NZH":"滿(mǎn)洲里","LUM":"芒市","MXZ":"梅州","MIG":"綿陽(yáng)" ,"OHE":"漠河","MDG":"牡丹江","MFK":"馬祖" ,"KHN":"南昌","NAO":"南充","NKG":"南京","NNG":"南寧","NTG":"南通","NNY":"南陽(yáng)" ,"NGB":"寧波","NLH":"寧蒗","PZI":"攀枝花","SYM":"普洱","NDG":"齊齊哈爾","JIQ":"黔江","IQM":"且末","BPE":"秦皇島","TAO":"青島" ,"IQN":"慶陽(yáng)","JUZ":"衢州","RKZ":"日喀則","RIZ":"日照","SYX":"三亞","XMN":"廈門(mén)","SHA":"上海","SZX":"深圳","HPG":"神農(nóng)架" ,"SHE":"沈陽(yáng)","SJW":"石家莊","TCG":"塔城","HYN":"臺(tái)州","TYN":"太原","YTY":"泰州","TVS":"唐山","TCZ":"騰沖","TSN":"天津" ,"THQ":"天水","TGO":"通遼","TEN":"銅仁","TLQ":"吐魯番","WXN":"萬(wàn)州","WEH":"威海","WEF":"濰坊","WNZ":"溫州","WNH":"文山" ,"WUA":"烏海","HLH":"烏蘭浩特","URC":"烏魯木齊","WUX":"無(wú)錫","WUZ":"梧州","WUH":"武漢","WUS":"武夷山","SIA":"西安","XIC":"西昌" ,"XNN":"西寧","JHG":"西雙版納","XIL":"錫林浩特","DIG":"香格里拉(迪慶)","XFN":"襄陽(yáng)","ACX":"興義","XUZ":"徐州","HKG":"香港" ,"YNT":"煙臺(tái)","ENY":"延安","YNJ":"延吉","YNZ":"鹽城","YTY":"揚(yáng)州","LDS":"伊春","YIN":"伊寧","YBP":"宜賓","YIH":"宜昌" ,"YIC":"宜春","YIW":"義烏","INC":"銀川","LLF":"永州","UYN":"榆林","YUS":"玉樹(shù)","YCU":"運(yùn)城","ZHA":"湛江","DYG":"張家界" ,"ZQZ":"張家口","YZY":"張掖","ZAT":"昭通","CGO":"鄭州","ZHY":"中衛(wèi)","HSN":"舟山","ZUH":"珠海","WMT":"遵義(茅臺(tái))","ZYI":"遵義(新舟)"}
為了防止頻繁請(qǐng)求出現(xiàn) 429,UserAgent 也找多一些讓其隨機(jī)取值。但是有時(shí)候太頻繁則需要輸入驗(yàn)證碼,所以還是每爬取一個(gè)出發(fā)城市,暫停10秒鐘吧。
先創(chuàng)建表用于存儲(chǔ)數(shù)據(jù),此處用的是 SQL Server:
CREATE TABLE KKFlight( ID int IDENTITY(1,1), --自增ID ItinerarDate date, --行程日期 Airline varchar(100), --航空公司 AirlineCode varchar(100), --航空公司代碼 FlightNumber varchar(20), --航班號(hào) FlightNumberS varchar(20), --航班號(hào)-共享(實(shí)際航班) Aircraft varchar(50), --飛機(jī)型號(hào) AircraftSize char(2), --型號(hào)大小(L大;M中;S小) AirportTax decimal(10,2), --機(jī)場(chǎng)建設(shè)費(fèi) FuelOilTax decimal(10,2), --燃油稅 FromCity varchar(50), --出發(fā)城市 FromCityCode varchar(10), --出發(fā)城市代碼 FromAirport varchar(50), --出發(fā)機(jī)場(chǎng) FromTerminal varchar(20), --出發(fā)航站樓 FromDateTime datetime, --出發(fā)時(shí)間 ToCity varchar(50), --到達(dá)城市 ToCityCode varchar(10), --到達(dá)城市代碼 ToAirport varchar(50), --到達(dá)機(jī)場(chǎng) ToTerminal varchar(20), --到達(dá)航站樓 ToDateTime datetime, --到達(dá)時(shí)間 DurationHour int, --時(shí)長(zhǎng)(小時(shí)h) DurationMinute int, --時(shí)長(zhǎng)(分鐘m) Duration varchar(20), --時(shí)長(zhǎng)(字符串) Currency varchar(10), --幣種 TicketPrices decimal(10,2), --票價(jià) Discount decimal(4,2), --已打折扣 PunctualityRate decimal(4,2), --準(zhǔn)點(diǎn)率 AircraftCabin char(1), --倉(cāng)位(F頭等艙;C公務(wù)艙;Y經(jīng)濟(jì)艙) InsertDate datetime default(getdate()), --添加時(shí)間 )
因?yàn)槭桥廊∷谐鞘?,所以城市不限制,只限制日期,即爬取哪天至哪天的?shù)據(jù)。全部腳本如下:
#-*- coding: utf-8 -*- # python 3.5.0 import json import time import random import datetime import sqlalchemy import urllib.request import pandas as pd from operator import itemgetter from dateutil.parser import parse class FLIGHT(object): def __init__(self): self.Airline = {} #航空公司代碼 self.engine = sqlalchemy.create_engine("mssql+pymssql://kk:kk@HZC/Myspider") self.url = '' self.headers = {} self.city={"AAT":"阿勒泰","ACX":"興義","AEB":"百色","AKU":"阿克蘇","AOG":"鞍山","AQG":"安慶","AVA":"安順","AXF":"阿拉善左旗","BAV":"包頭","BFJ":"畢節(jié)","BHY":"北海" ,"BJS":"北京","BPE":"秦皇島","BPL":"博樂(lè)","BPX":"昌都","BSD":"保山","CAN":"廣州","CDE":"承德","CGD":"常德","CGO":"鄭州","CGQ":"長(zhǎng)春","CHG":"朝陽(yáng)","CIF":"赤峰" ,"CIH":"長(zhǎng)治","CKG":"重慶","CSX":"長(zhǎng)沙","CTU":"成都","CWJ":"滄源","CYI":"嘉義","CZX":"常州","DAT":"大同","DAX":"達(dá)縣","DBC":"白城","DCY":"稻城","DDG":"丹東" ,"DIG":"香格里拉(迪慶)","DLC":"大連","DLU":"大理","DNH":"敦煌","DOY":"東營(yíng)","DQA":"大慶","DSN":"鄂爾多斯","DYG":"張家界","EJN":"額濟(jì)納旗","ENH":"恩施" ,"ENY":"延安","ERL":"二連浩特","FOC":"福州","FUG":"阜陽(yáng)","FUO":"佛山","FYJ":"撫遠(yuǎn)","GOQ":"格爾木","GYS":"廣元","GYU":"固原","HAK":"???,"HDG":"邯鄲" ,"HEK":"黑河","HET":"呼和浩特","HFE":"合肥","HGH":"杭州","HIA":"淮安","HJJ":"懷化","HKG":"香港","HLD":"海拉爾","HLH":"烏蘭浩特","HMI":"哈密","HPG":"神農(nóng)架" ,"HRB":"哈爾濱","HSN":"舟山","HTN":"和田","HUZ":"惠州","HYN":"臺(tái)州","HZG":"漢中","HZH":"黎平","INC":"銀川","IQM":"且末","IQN":"慶陽(yáng)","JDZ":"景德鎮(zhèn)" ,"JGD":"加格達(dá)奇","JGN":"嘉峪關(guān)","JGS":"井岡山","JHG":"西雙版納","JIC":"金昌","JIQ":"黔江","JIU":"九江","JJN":"晉江","JMJ":"瀾滄","JMU":"佳木斯","JNG":"濟(jì)寧" ,"JNZ":"錦州","JSJ":"建三江","JUH":"池州","JUZ":"衢州","JXA":"雞西","JZH":"九寨溝","KCA":"庫(kù)車(chē)","KGT":"康定","KHG":"喀什","KHN":"南昌","KJH":"凱里","KMG":"昆明" ,"KNH":"金門(mén)","KOW":"贛州","KRL":"庫(kù)爾勒","KRY":"克拉瑪依","KWE":"貴陽(yáng)","KWL":"桂林","LCX":"龍巖","LDS":"伊春","LFQ":"臨汾","LHW":"蘭州","LJG":"麗江","LLB":"荔波" ,"LLF":"永州","LLV":"呂梁","LNJ":"臨滄","LPF":"六盤(pán)水","LUM":"芒市","LXA":"拉薩","LYA":"洛陽(yáng)","LYG":"連云港","LYI":"臨沂","LZH":"柳州","LZO":"瀘州" ,"LZY":"林芝","MDG":"牡丹江","MFK":"馬祖","MFM":"澳門(mén)","MIG":"綿陽(yáng)","MXZ":"梅州","NAO":"南充","NBS":"白山","NDG":"齊齊哈爾","NGB":"寧波","NGQ":"阿里" ,"NKG":"南京","NLH":"寧蒗","NNG":"南寧","NNY":"南陽(yáng)","NTG":"南通","NZH":"滿(mǎn)洲里","OHE":"漠河","PZI":"攀枝花","RHT":"阿拉善右旗","RIZ":"日照","RKZ":"日喀則" ,"RLK":"巴彥淖爾","SHA":"上海","SHE":"沈陽(yáng)","SIA":"西安","SJW":"石家莊","SWA":"揭陽(yáng)","SYM":"普洱","SYX":"三亞","SZX":"深圳","TAO":"青島","TCG":"塔城","TCZ":"騰沖" ,"TEN":"銅仁","TGO":"通遼","THQ":"天水","TLQ":"吐魯番","TNA":"濟(jì)南","TSN":"天津","TVS":"唐山","TXN":"黃山","TYN":"太原","URC":"烏魯木齊","UYN":"榆林","WEF":"濰坊" ,"WEH":"威海","WMT":"遵義(茅臺(tái))","WNH":"文山","WNZ":"溫州","WUA":"烏海","WUH":"武漢","WUS":"武夷山","WUX":"無(wú)錫","WUZ":"梧州","WXN":"萬(wàn)州","XFN":"襄陽(yáng)","XIC":"西昌" ,"XIL":"錫林浩特","XMN":"廈門(mén)","XNN":"西寧","XUZ":"徐州","YBP":"宜賓","YCU":"運(yùn)城","YIC":"宜春","YIE":"阿爾山","YIH":"宜昌","YIN":"伊寧","YIW":"義烏","YNJ":"延吉" ,"YNT":"煙臺(tái)","YNZ":"鹽城","YTY":"揚(yáng)州","YUS":"玉樹(shù)","YZY":"張掖","ZAT":"昭通","ZHA":"湛江","ZHY":"中衛(wèi)","ZQZ":"張家口","ZUH":"珠海","ZYI":"遵義(新舟)"} """{"KJI":"布爾津"}""" self.UserAgent = [ "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" ] #遍歷兩個(gè)日期間的所有日期 def set_url_headers(self,startdate,enddate): startDate=datetime.datetime.strptime(startdate,'%Y-%m-%d') endDate=datetime.datetime.strptime(enddate,'%Y-%m-%d') while startDate<=endDate: today = startDate.strftime('%Y-%m-%d') for fromcode, fromcity in sorted(self.city.items(), key=itemgetter(0)): for tocode, tocity in sorted(self.city.items(), key=itemgetter(0)): if fromcode != tocode: self.url = 'http://flights.ctrip.com/domesticsearch/search/SearchFirstRouteFlights?DCity1=%s&ACity1=%s&SearchType=S&DDate1=%s&IsNearAirportRecommond=0&LogToken=027e478a47494975ad74857b18283e12&rk=4.381066884522498182534&CK=9FC7881E8F373585C0E5F89152BC143D&r=0.24149333708195565406316' % (fromcode,tocode,today) self.headers = { "Host": "flights.ctrip.com", "User-Agent": random.choice(self.UserAgent), "Referer": "https://flights.ctrip.com/booking/%s-%s-day-1.html?DDate1=%s" % (fromcode,tocode,today), "Connection": "keep-alive", } print("%s : %s(%s) ==> %s(%s) " % (today,fromcity,fromcode,tocity,tocode)) self.get_parse_json_data(today) time.sleep(10) startDate+=datetime.timedelta(days=1) #獲取一個(gè)頁(yè)面中的數(shù)據(jù) def get_one_page_json_data(self): req = urllib.request.Request(self.url,headers=self.headers) body = urllib.request.urlopen(req,timeout=30).read().decode('gbk') jsonData = json.loads(body.strip("'<>() ").replace('\'', '\"')) return jsonData #獲取一個(gè)頁(yè)面中的數(shù)據(jù),解析保存到數(shù)據(jù)庫(kù) def get_parse_json_data(self,today): jsonData = self.get_one_page_json_data() df = pd.DataFrame(columns=['ItinerarDate','Airline','AirlineCode','FlightNumber','FlightNumberS','Aircraft','AircraftSize' ,'AirportTax','FuelOilTax','FromCity','FromCityCode','FromAirport','FromTerminal','FromDateTime','ToCity','ToCityCode','ToAirport' ,'ToTerminal','ToDateTime','DurationHour','DurationMinute','Duration','Currency','TicketPrices','Discount','PunctualityRate','AircraftCabin']) if bool(jsonData["fis"]): #獲取航空公司代碼及公司名稱(chēng) company = jsonData["als"] for k in company.keys(): if k not in self.Airline: self.Airline[k]=company[k] index = 0 for data in jsonData["fis"]: df.loc[index,'ItinerarDate'] = today #行程日期 #df.loc[index,'Airline'] = self.Airline[data["alc"].strip()] #航空公司 df.loc[index,'Airline'] = self.Airline[data["alc"].strip()] if (data["alc"].strip() in self.Airline) else None #航空公司 df.loc[index,'AirlineCode'] = data["alc"].strip() #航空公司代碼 df.loc[index,'FlightNumber'] = data["fn"] #航班號(hào) df.loc[index,'FlightNumberS'] = data["sdft"] #共享航班號(hào)(實(shí)際航班) df.loc[index,'Aircraft'] = data["cf"]["c"] #飛機(jī)型號(hào) df.loc[index,'AircraftSize'] = data["cf"]["s"] #型號(hào)大小(L大;M中;S小) df.loc[index,'AirportTax'] = data["tax"] #機(jī)場(chǎng)建設(shè)費(fèi) df.loc[index,'FuelOilTax'] = data["of"] #燃油稅 df.loc[index,'FromCity'] = data["acn"] #出發(fā)城市 df.loc[index,'FromCityCode'] = data["acc"] #出發(fā)城市代碼 df.loc[index,'FromAirport'] = data["apbn"] #出發(fā)機(jī)場(chǎng) df.loc[index,'FromTerminal'] = data["asmsn"] #出發(fā)航站樓 df.loc[index,'FromDateTime'] = data["dt"] #出發(fā)時(shí)間 df.loc[index,'ToCity'] = data["dcn"] #到達(dá)城市 df.loc[index,'ToCityCode'] = data["dcc"] #到達(dá)城市代碼 df.loc[index,'ToAirport'] = data["dpbn"] #到達(dá)機(jī)場(chǎng) df.loc[index,'ToTerminal'] = data["dsmsn"] #到達(dá)航站樓 df.loc[index,'ToDateTime'] = data["at"] #到達(dá)時(shí)間 df.loc[index,'DurationHour'] = int((parse(data["at"])-parse(data["dt"])).seconds/3600) #時(shí)長(zhǎng)(小時(shí)h) df.loc[index,'DurationMinute'] = int((parse(data["at"])-parse(data["dt"])).seconds%3600/60) #時(shí)長(zhǎng)(分鐘m) df.loc[index,'Duration'] = str(df.loc[index,'DurationHour']) + 'h' + str(df.loc[index,'DurationMinute']) + 'm' #時(shí)長(zhǎng)(字符串) df.loc[index,'Currency'] = None #幣種 df.loc[index,'TicketPrices'] = data["lp"] #票價(jià) df.loc[index,'Discount'] = None #已打折扣 df.loc[index,'PunctualityRate'] = None #準(zhǔn)點(diǎn)率 df.loc[index,'AircraftCabin'] = None #倉(cāng)位(F頭等艙;C公務(wù)艙;Y經(jīng)濟(jì)艙) index = index + 1 df.to_sql("KKFlight", self.engine, index=False, if_exists='append') print("done!~") if __name__ == "__main__": fly = FLIGHT() fly.set_url_headers('2018-06-16','2018-06-16')
總結(jié)
以上所述是小編給大家介紹的Python 爬取攜程所有機(jī)票,希望對(duì)大家有所幫助,如果大家有任何疑問(wèn)請(qǐng)給我留言,小編會(huì)及時(shí)回復(fù)大家的。在此也非常感謝大家對(duì)腳本之家網(wǎng)站的支持!
相關(guān)文章
Python學(xué)習(xí)之模塊化程序設(shè)計(jì)示例詳解
程序設(shè)計(jì)的模塊化指的是在進(jìn)行程序設(shè)計(jì)時(shí),把一個(gè)大的程序功能劃分為若干個(gè)小的程序模塊。每一個(gè)小程序模塊實(shí)現(xiàn)一個(gè)確定的功能,并且在這些小程序模塊實(shí)現(xiàn)的功能之間建立必要的聯(lián)系。本文將利用示例詳細(xì)介紹一下Python的模塊化程序設(shè)計(jì),需要的可以參考一下2022-03-03Python tkinter之ComboBox(下拉框)的使用簡(jiǎn)介
這篇文章主要介紹了Python tkinter之ComboBox(下拉框)的使用簡(jiǎn)介,幫助大家更好的理解和使用python,感興趣的朋友可以了解下2021-02-02django的403/404/500錯(cuò)誤自定義頁(yè)面的配置方式
這篇文章主要介紹了django的403/404/500錯(cuò)誤自定義頁(yè)面的配置方式,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2020-05-05python 用 xlwings 庫(kù) 生成圖表的操作方法
這篇文章主要介紹了python 用 xlwings 庫(kù) 生成圖表的方法,本文給大家介紹的非常詳細(xì),具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2019-12-12