Python實(shí)現(xiàn)簡(jiǎn)單HTML表格解析的方法
本文實(shí)例講述了Python實(shí)現(xiàn)簡(jiǎn)單HTML表格解析的方法。分享給大家供大家參考。具體分析如下:
這里依賴(lài)libxml2dom,確保首先安裝!導(dǎo)入到你的腳步并調(diào)用parse_tables() 函數(shù)。
1. source = a string containing the source code you can pass in just the table or the entire page code
2. headers = a list of ints OR a list of strings
If the headers are ints this is for tables with no header, just list the 0 based index of the rows in which you want to extract data.
If the headers are strings this is for tables with header columns (with the tags) it will pull the information from the specified columns
3. The 0 based index of the table in the source code. If there are multiple tables and the table you want to parse is the third table in the code then pass in the number 2 here
It will return a list of lists. each inner list will contain the parsed information.
具體代碼如下:
#The goal of table parser is to get specific information from specific #columns in a table. #Input: source code from a typical website #Arguments: a list of headers the user wants to return #Output: A list of lists of the data in each row import libxml2dom def parse_tables(source, headers, table_index): """parse_tables(string source, list headers, table_index) headers may be a list of strings if the table has headers defined or headers may be a list of ints if no headers defined this will get data from the rows index. This method returns a list of lists """ #Determine if the headers list is strings or ints and make sure they #are all the same type j = 0 print 'Printing headers: ',headers #route to the correct function #if the header type is int if type(headers[0]) == type(1): #run no_header function return no_header(source, headers, table_index) #if the header type is string elif type(headers[0]) == type('a'): #run the header_given function return header_given(source, headers, table_index) else: #return none if the headers aren't correct return None #This function takes in the source code of the whole page a string list of #headers and the index number of the table on the page. It returns a list of #lists with the scraped information def header_given(source, headers, table_index): #initiate a list to hole the return list return_list = [] #initiate a list to hold the index numbers of the data in the rows header_index = [] #get a document object out of the source code doc = libxml2dom.parseString(source,html=1) #get the tables from the document tables = doc.getElementsByTagName('table') try: #try to get focue on the desired table main_table = tables[table_index] except: #if the table doesn't exits then return an error return ['The table index was not found'] #get a list of headers in the table table_headers = main_table.getElementsByTagName('th') #need a sentry value for the header loop loop_sentry = 0 #loop through each header looking for matches for header in table_headers: #if the header is in the desired headers list if header.textContent in headers: #add it to the header_index header_index.append(loop_sentry) #add one to the loop_sentry loop_sentry+=1 #get the rows from the table rows = main_table.getElementsByTagName('tr') #sentry value detecting if the first row is being viewed row_sentry = 0 #loop through the rows in the table, skipping the first row for row in rows: #if row_sentry is 0 this is our first row if row_sentry == 0: #make the row_sentry not 0 row_sentry = 1337 continue #get all cells from the current row cells = row.getElementsByTagName('td') #initiate a list to append into the return_list cell_list = [] #iterate through all of the header index's for i in header_index: #append the cells text content to the cell_list cell_list.append(cells[i].textContent) #append the cell_list to the return_list return_list.append(cell_list) #return the return_list return return_list #This function takes in the source code of the whole page an int list of #headers indicating the index number of the needed item and the index number #of the table on the page. It returns a list of lists with the scraped info def no_header(source, headers, table_index): #initiate a list to hold the return list return_list = [] #get a document object out of the source code doc = libxml2dom.parseString(source, html=1) #get the tables from document tables = doc.getElementsByTagName('table') try: #Try to get focus on the desired table main_table = tables[table_index] except: #if the table doesn't exits then return an error return ['The table index was not found'] #get all of the rows out of the main_table rows = main_table.getElementsByTagName('tr') #loop through each row for row in rows: #get all cells from the current row cells = row.getElementsByTagName('td') #initiate a list to append into the return_list cell_list = [] #loop through the list of desired headers for i in headers: try: #try to add text from the cell into the cell_list cell_list.append(cells[i].textContent) except: #if there is an error usually an index error just continue continue #append the data scraped into the return_list return_list.append(cell_list) #return the return list return return_list
希望本文所述對(duì)大家的Python程序設(shè)計(jì)有所幫助。
- 利用Python定位Span標(biāo)簽中文字的實(shí)戰(zhàn)指南
- python 解析html之BeautifulSoup
- Python中使用HTMLParser解析html實(shí)例
- python抓取某汽車(chē)網(wǎng)數(shù)據(jù)解析html存入excel示例
- Python使用BeautifulSoup庫(kù)解析HTML基本使用教程
- python爬蟲(chóng)入門(mén)教程--HTML文本的解析庫(kù)BeautifulSoup(四)
- python解析html提取數(shù)據(jù),并生成word文檔實(shí)例解析
- python解析HTML并提取span標(biāo)簽中的文本
相關(guān)文章
我用Python給班主任寫(xiě)了一個(gè)自動(dòng)閱卷腳本(附源碼)
這篇文章主要介紹了如何用Python給寫(xiě)了一個(gè)自動(dòng)閱卷腳本,本文給大家介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2021-08-08Python實(shí)現(xiàn)目錄自動(dòng)清洗
這篇文章主要為大家詳細(xì)介紹了Python實(shí)現(xiàn)目錄自動(dòng)清洗的相關(guān)知識(shí),文中的示例代碼講解詳細(xì),具有一定的借鑒價(jià)值,感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下2023-11-11使用python實(shí)現(xiàn)簡(jiǎn)單去水印功能
這篇文章主要為大家詳細(xì)介紹了使用python實(shí)現(xiàn)簡(jiǎn)單去水印功能,文中示例代碼介紹的非常詳細(xì),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2022-05-05基于Python實(shí)現(xiàn)文件分類(lèi)器的示例代碼
這篇文章主要為大家詳細(xì)介紹了如何基于Python實(shí)現(xiàn)文件分類(lèi)器,目的主要是為了將辦公過(guò)程中產(chǎn)生的各種格式的文件完成整理,感興趣的可以了解一下2023-04-04Python自然語(yǔ)言處理停用詞過(guò)濾實(shí)例詳解
這篇文章主要為大家介紹了Python自然語(yǔ)言處理停用詞過(guò)濾實(shí)例詳解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2024-01-01Python中使用字典對(duì)列表中的元素進(jìn)行計(jì)數(shù)的幾種方式
本文主要介紹了Python中使用字典對(duì)列表中的元素進(jìn)行計(jì)數(shù),文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2024-06-06手把手教你pycharm專(zhuān)業(yè)版安裝破解教程(linux版)
這篇文章主要介紹了 手把手教你pycharm專(zhuān)業(yè)版安裝破解教程(linux版),文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2019-09-09Python實(shí)戰(zhàn)快速上手BeautifulSoup庫(kù)爬取專(zhuān)欄標(biāo)題和地址
BeautifulSoup是爬蟲(chóng)必學(xué)的技能,BeautifulSoup最主要的功能是從網(wǎng)頁(yè)抓取數(shù)據(jù),Beautiful Soup自動(dòng)將輸入文檔轉(zhuǎn)換為Unicode編碼,輸出文檔轉(zhuǎn)換為utf-8編碼2021-10-10