Python實(shí)現(xiàn)網(wǎng)頁(yè)搜索和數(shù)據(jù)提取的示例詳解

更新時(shí)間：2025年08月28日 08:50:41 作者：步子哥

在當(dāng)今信息化的時(shí)代,獲取信息變得越來(lái)越簡(jiǎn)單,本文將通過(guò)Python代碼實(shí)現(xiàn)與Google及維基百科等網(wǎng)站的互動(dòng),幫助用戶獲取所需信息,感興趣的小伙伴可以了解下

在當(dāng)今信息化的時(shí)代，獲取信息變得越來(lái)越簡(jiǎn)單，借助編程，我們可以快速實(shí)現(xiàn)網(wǎng)頁(yè)搜索和數(shù)據(jù)提取。本文將通過(guò)Python代碼實(shí)現(xiàn)與Google及維基百科等網(wǎng)站的互動(dòng)，幫助用戶獲取所需信息。

1. Google搜索功能

我們首先需要實(shí)現(xiàn)一個(gè)能夠與Google進(jìn)行交互的搜索功能。以下是實(shí)現(xiàn)這一功能的代碼示例：

def google_search(query: str) -> str:
    """
    google search with query, return a result in string
    """
    import os
    import json
    import requests
    SERPER_API_KEY = os.environ.get('SERPER_API_KEY', None)
    if SERPER_API_KEY is None:
        raise Exception('Please set SERPER_API_KEY in environment variable first.')
    url = "https://google.serper.dev/search"
    payload = json.dumps({"q": query})
    headers = {
        'X-API-KEY': SERPER_API_KEY,
        'Content-Type': 'application/json'
    }
    response = requests.request("POST", url, headers=headers, data=payload)
    json_data = json.loads(response.text)
    return json.dumps(json_data, ensure_ascii=True, indent=4)

在這個(gè)函數(shù)中，我們使用了requests庫(kù)來(lái)發(fā)送HTTP請(qǐng)求。首先，我們需要從環(huán)境變量中獲取API密鑰，以便能夠訪問(wèn)Serper API。然后構(gòu)建請(qǐng)求的URL和負(fù)載體，最終返回搜索結(jié)果。

例子

假設(shè)我們想要搜索“成都人口”，可以調(diào)用上述函數(shù)：

result = google_search('成都 人口')
print(result)

2. 維基百科搜索功能

除了Google搜索，我們還可以實(shí)現(xiàn)一個(gè)與維基百科互動(dòng)的搜索功能。以下是該功能的實(shí)現(xiàn)代碼：

def wikipedia_search(query: str) -> str:
    """
    wikipedia search with query, return a result in string
    """
    import requests
    from bs4 import BeautifulSoup

    def get_page_obs(page):
        paragraphs = page.split("\n")
        paragraphs = [p.strip() for p in paragraphs if p.strip()]

        sentences = []
        for p in paragraphs:
            sentences += p.split('. ')
        sentences = [s.strip() + '.' for s in sentences if s.strip()]
        return ' '.join(sentences[:5])

    def clean_str(s):
        return s.replace("\xa0", " ").replace("\n", " ")

    entity = query.replace(" ", "+")
    search_url = f"https://en.wikipedia.org/w/index.php?search={entity}"
    response_text = requests.get(search_url).text
    soup = BeautifulSoup(response_text, features="html.parser")
    result_divs = soup.find_all("div", {"class": "mw-search-result-heading"})
    if result_divs:
        result_titles = [clean_str(div.get_text().strip()) for div in result_divs]
        obs = f"Could not find {query}. Similar: {result_titles[:5]}."
    else:
        page = [p.get_text().strip() for p in soup.find_all("p") + soup.find_all("ul")]
        if any("may refer to:" in p for p in page):
            obs = wikipedia_search("[" + query + "]")
        else:
            page_content = ""
            for p in page:
                if len(p.split(" ")) > 2:
                    page_content += ' ' + clean_str(p)
                    if not p.endswith("\n"):
                        page_content += "\n"
            obs = get_page_obs(page_content)
            if not obs:
                obs = None
    return obs

在這個(gè)函數(shù)中，我們使用BeautifulSoup庫(kù)解析維基百科搜索結(jié)果，并提取相關(guān)信息。

例子

如果我們想查找“Python 語(yǔ)言”的信息，可以使用以下代碼：

result = wikipedia_search('Python 語(yǔ)言')
print(result)

3. 使用Selenium獲取HTML內(nèi)容

有時(shí)，網(wǎng)頁(yè)中的內(nèi)容是通過(guò)JavaScript動(dòng)態(tài)加載的，簡(jiǎn)單的HTTP請(qǐng)求無(wú)法獲取這些內(nèi)容。此時(shí)，我們可以使用Selenium進(jìn)行網(wǎng)頁(yè)操作。以下是實(shí)現(xiàn)這一功能的代碼：

def _web_driver_open(url: str, wait_time=10, scroll_to_bottom=False):
    """
    open a web page in browser and wait the page load completely, return the Selenium 4 driver.
    """
    import os
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager
    import time

    CHROME_GRID_URL = os.environ.get('CHROME_GRID_URL', None)
    if CHROME_GRID_URL is not None:
        chrome_options = Options()
        driver = webdriver.Remote(command_executor=CHROME_GRID_URL, options=chrome_options)
    else:
        chrome_options = Options()
        chrome_options.add_argument("--headless")  # Ensure GUI is off
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36")
        webdriver_service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)

    driver.get(url)
    driver.implicitly_wait(wait_time)
    if scroll_to_bottom:
        last_height = driver.execute_script("return document.body.scrollHeight")
        for _ in range(2):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(3)
            new_height = driver.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                break
            last_height = new_height
    return driver

在這個(gè)函數(shù)中，我們初始化了Selenium WebDriver并打開指定的URL。如果需要，可以選擇性地滾動(dòng)頁(yè)面以加載更多內(nèi)容。

獲取HTML內(nèi)容

使用以下函數(shù)可以獲取網(wǎng)頁(yè)的清晰HTML內(nèi)容：

def _web_driver_get_html(driver) -> str:
    """
    return clear html content (without script, style and comment) of the Selenium 4 driver, the driver should be ready.
    """
    from bs4 import BeautifulSoup, Comment
    from urllib.parse import urljoin
    url = driver.current_url
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    
    for script_or_style in soup(['script', 'style']):
        script_or_style.decompose()
    for comment in soup(text=lambda text: isinstance(text, Comment)):
        comment.extract()
    
    for tag in soup(['head', 'meta', 'link', 'title', 'noscript', 'iframe', 'svg', 'canvas', 'audio', 'video', 'embed', 'object', 'param', 'source', 'track', 'map', 'area', 'base', 'basefont', 'bdi', 'bdo', 'br', 'col', 'colgroup', 'datalist', 'details', 'dialog', 'hr', 'img', 'input', 'keygen', 'label', 'legend', 'meter', 'optgroup', 'option', 'output', 'progress', 'select', 'textarea']):
        tag.decompose()
    
    for tag in soup(['div', 'span']):
        tag.attrs = {}
    
    for a in soup.find_all('a', href=True):
        a['href'] = urljoin(url, a['href'])
    
    for img in soup.find_all('img', src=True):
        img['src'] = urljoin(url, img['src'])
    
    html = str(soup)
    return html

例子

我們可以使用以下代碼獲取指定網(wǎng)頁(yè)的HTML內(nèi)容：

html_content = web_get_html('https://example.com')
print(html_content)

4. 獲取網(wǎng)頁(yè)文本內(nèi)容

如果只需要獲取網(wǎng)頁(yè)的文本內(nèi)容，可以使用以下函數(shù)：

def web_get_text(url:str, wait_time=10, scroll_to_bottom=True):
    """
    獲取網(wǎng)頁(yè)的文本內(nèi)容
    """
    import logging
    driver = None
    try:
        driver = _web_driver_open(url, wait_time, scroll_to_bottom)
        text = driver.execute_script("return document.body.innerText")
        return text
    except Exception as e:
        logging.exception(e)
        return 'Some Error Occurs:\n' + str(e)
    finally:
        if driver is not None:
            driver.quit()

例子

調(diào)用這個(gè)函數(shù)獲取網(wǎng)頁(yè)的文本內(nèi)容：

text_content = web_get_text('https://example.com')
print(text_content)

結(jié)論

通過(guò)以上代碼示例，我們展示了如何使用Python實(shí)現(xiàn)網(wǎng)頁(yè)搜索和數(shù)據(jù)提取功能。這些技術(shù)可以廣泛應(yīng)用于信息收集、數(shù)據(jù)分析等領(lǐng)域。隨著技術(shù)的不斷發(fā)展，我們期待未來(lái)能有更高效的方式來(lái)獲取和分析信息。

到此這篇關(guān)于Python實(shí)現(xiàn)網(wǎng)頁(yè)搜索和數(shù)據(jù)提取的示例詳解的文章就介紹到這了,更多相關(guān)Python網(wǎng)頁(yè)搜索與數(shù)據(jù)提取內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: