Python使用BeautifulSoup提取網(wǎng)頁數(shù)據(jù)的完整指南
引言:為什么說BeautifulSoup是網(wǎng)頁數(shù)據(jù)提取的"瑞士軍刀"?
想象一下,你面前有一本厚厚的電話簿,你需要找到所有姓"張"的人的電話號碼。如果用手一頁頁翻找,那得花多長時(shí)間?但如果有一個(gè)智能助手,能夠瞬間幫你定位并提取所有相關(guān)信息,那該多么高效!
BeautifulSoup就是這樣一個(gè)"智能助手",專門幫我們從復(fù)雜的HTML網(wǎng)頁中精準(zhǔn)提取所需的數(shù)據(jù)。它就像一把瑞士軍刀,功能強(qiáng)大、使用簡單,是每個(gè)Python開發(fā)者都應(yīng)該掌握的利器。
第一部分:BeautifulSoup核心概念解析
1.1 什么是BeautifulSoup?
BeautifulSoup是一個(gè)Python庫,專門用于從HTML和XML文檔中提取數(shù)據(jù)。它能夠?qū)?fù)雜的HTML文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹形結(jié)構(gòu),每個(gè)節(jié)點(diǎn)都是Python對象。
from bs4 import BeautifulSoup import requests # 獲取網(wǎng)頁內(nèi)容 url = "https://example.com" response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # 現(xiàn)在你可以像操作Python對象一樣操作HTML title = soup.title.text print(f"網(wǎng)頁標(biāo)題:{title}")
1.2 BeautifulSoup的核心優(yōu)勢
1. 容錯(cuò)能力強(qiáng)
BeautifulSoup能夠處理各種不規(guī)范的HTML,就像一個(gè)經(jīng)驗(yàn)豐富的醫(yī)生,即使面對"病癥復(fù)雜"的網(wǎng)頁也能準(zhǔn)確診斷。
2. API設(shè)計(jì)直觀
它的語法設(shè)計(jì)非常人性化,讀代碼就像讀英語一樣自然。
3. 解析器靈活
支持多種解析器,可以根據(jù)需求選擇最合適的工具。
第二部分:選擇合適的解析器
2.1 解析器對比分析
BeautifulSoup支持多種解析器,每種都有其特點(diǎn):
from bs4 import BeautifulSoup html_doc = """ <html> <head><title>測試頁面</title></head> <body> <p class="story">這是一個(gè)段落</p> </body> </html> """ # Python內(nèi)置解析器(推薦入門使用) soup1 = BeautifulSoup(html_doc, 'html.parser') # lxml解析器(推薦生產(chǎn)環(huán)境使用) soup2 = BeautifulSoup(html_doc, 'lxml') # html5lib解析器(最準(zhǔn)確但最慢) soup3 = BeautifulSoup(html_doc, 'html5lib')
2.2 解析器選擇建議
- 開發(fā)學(xué)習(xí)階段:使用
html.parser
,無需額外安裝 - 生產(chǎn)環(huán)境:使用
lxml
,速度快且功能強(qiáng)大 - 嚴(yán)格HTML5標(biāo)準(zhǔn):使用
html5lib
,準(zhǔn)確度最高
第三部分:元素定位的藝術(shù)
3.1 基礎(chǔ)定位方法
BeautifulSoup提供了多種定位元素的方法,就像GPS定位一樣精準(zhǔn):
from bs4 import BeautifulSoup html = """ <html> <body> <div class="container"> <h1 id="main-title">新聞標(biāo)題</h1> <p class="content">新聞內(nèi)容第一段</p> <p class="content">新聞內(nèi)容第二段</p> <a rel="external nofollow" class="link">相關(guān)鏈接</a> </div> </body> </html> """ soup = BeautifulSoup(html, 'html.parser') # 1. 通過標(biāo)簽名定位 title = soup.h1 print(f"標(biāo)題:{title.text}") # 2. 通過ID定位 main_title = soup.find('h1', id='main-title') print(f"主標(biāo)題:{main_title.text}") # 3. 通過類名定位 content_list = soup.find_all('p', class_='content') for content in content_list: print(f"內(nèi)容:{content.text}") # 4. 通過屬性定位 link = soup.find('a', ) print(f"鏈接文本:{link.text}") print(f"鏈接地址:{link['href']}")
3.2 高級定位技巧
CSS選擇器:精準(zhǔn)制導(dǎo)
CSS選擇器就像GPS坐標(biāo),能夠精確定位到任何元素:
# CSS選擇器示例 soup = BeautifulSoup(html, 'html.parser') # 類選擇器 contents = soup.select('.content') # ID選擇器 title = soup.select('#main-title')[0] # 層級選擇器 container_p = soup.select('div.container p') # 屬性選擇器 external_links = soup.select('a[href^="http"]') # 偽類選擇器 first_p = soup.select('p:first-child')
正則表達(dá)式:模糊匹配
有時(shí)候我們需要進(jìn)行模糊匹配,正則表達(dá)式就是最好的工具:
import re # 使用正則表達(dá)式匹配屬性 email_links = soup.find_all('a', href=re.compile(r'mailto:')) phone_numbers = soup.find_all(string=re.compile(r'\d{3}-\d{4}-\d{4}'))
第四部分:數(shù)據(jù)提取實(shí)戰(zhàn)技巧
4.1 文本提取的藝術(shù)
from bs4 import BeautifulSoup import requests def extract_news_data(url): """ 新聞數(shù)據(jù)提取示例 """ response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # 提取標(biāo)題 title = soup.find('h1', class_='article-title') title_text = title.text.strip() if title else "無標(biāo)題" # 提取發(fā)布時(shí)間 time_elem = soup.find('time') publish_time = time_elem.get('datetime') if time_elem else "未知時(shí)間" # 提取正文內(nèi)容 content_divs = soup.find_all('div', class_='article-content') content = '\n'.join([div.text.strip() for div in content_divs]) # 提取圖片鏈接 images = [] for img in soup.find_all('img'): src = img.get('src') if src: # 處理相對鏈接 if src.startswith('//'): src = 'https:' + src elif src.startswith('/'): src = 'https://example.com' + src images.append(src) return { 'title': title_text, 'publish_time': publish_time, 'content': content, 'images': images }
4.2 處理復(fù)雜HTML結(jié)構(gòu)
實(shí)際的網(wǎng)頁往往結(jié)構(gòu)復(fù)雜,我們需要更加精細(xì)的處理:
def extract_product_info(html): """ 電商產(chǎn)品信息提取示例 """ soup = BeautifulSoup(html, 'html.parser') product_info = {} # 提取產(chǎn)品名稱 name_elem = soup.find('h1', class_='product-name') product_info['name'] = name_elem.text.strip() if name_elem else "" # 提取價(jià)格(處理多種價(jià)格格式) price_elem = soup.find('span', class_='price') if price_elem: price_text = price_elem.text # 使用正則表達(dá)式提取數(shù)字 import re price_match = re.search(r'[\d,]+\.?\d*', price_text) product_info['price'] = float(price_match.group().replace(',', '')) if price_match else 0 # 提取產(chǎn)品參數(shù) specs = {} spec_table = soup.find('table', class_='specifications') if spec_table: for row in spec_table.find_all('tr'): cells = row.find_all(['td', 'th']) if len(cells) >= 2: key = cells[0].text.strip() value = cells[1].text.strip() specs[key] = value product_info['specifications'] = specs # 提取評論數(shù)據(jù) reviews = [] review_elements = soup.find_all('div', class_='review-item') for review in review_elements: rating_elem = review.find('span', class_='rating') content_elem = review.find('p', class_='review-content') if rating_elem and content_elem: reviews.append({ 'rating': len(rating_elem.find_all('span', class_='star-filled')), 'content': content_elem.text.strip() }) product_info['reviews'] = reviews return product_info
第五部分:高效數(shù)據(jù)處理技巧
5.1 批量處理與性能優(yōu)化
當(dāng)需要處理大量數(shù)據(jù)時(shí),性能優(yōu)化就變得至關(guān)重要:
import concurrent.futures from typing import List, Dict import time class WebScraper: def __init__(self, max_workers: int = 5): self.max_workers = max_workers self.session = requests.Session() # 設(shè)置通用請求頭 self.session.headers.update({ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }) def fetch_single_page(self, url: str) -> Dict: """ 獲取單個(gè)頁面數(shù)據(jù) """ try: response = self.session.get(url, timeout=10) response.raise_for_status() soup = BeautifulSoup(response.content, 'lxml') # 提取數(shù)據(jù) return self.extract_page_data(soup, url) except Exception as e: print(f"處理 {url} 時(shí)出錯(cuò): {e}") return {'url': url, 'error': str(e)} def extract_page_data(self, soup: BeautifulSoup, url: str) -> Dict: """ 從soup對象中提取數(shù)據(jù) """ title = soup.find('title') title_text = title.text.strip() if title else "" # 提取所有鏈接 links = [] for link in soup.find_all('a', href=True): href = link['href'] text = link.text.strip() if href and text: links.append({'url': href, 'text': text}) return { 'url': url, 'title': title_text, 'links': links, 'link_count': len(links) } def batch_scrape(self, urls: List[str]) -> List[Dict]: """ 批量抓取數(shù)據(jù) """ results = [] with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor: # 提交所有任務(wù) future_to_url = {executor.submit(self.fetch_single_page, url): url for url in urls} # 收集結(jié)果 for future in concurrent.futures.as_completed(future_to_url): result = future.result() results.append(result) print(f"已完成: {result.get('url', 'Unknown')}") return results # 使用示例 scraper = WebScraper(max_workers=3) urls = [ 'https://example1.com', 'https://example2.com', 'https://example3.com' ] results = scraper.batch_scrape(urls)
5.2 數(shù)據(jù)清洗與格式化
提取出的數(shù)據(jù)往往需要進(jìn)一步清洗:
import re from datetime import datetime class DataCleaner: @staticmethod def clean_text(text: str) -> str: """ 清洗文本數(shù)據(jù) """ if not text: return "" # 移除多余空白字符 text = re.sub(r'\s+', ' ', text) # 移除HTML實(shí)體 text = text.replace(' ', ' ') text = text.replace('<', '<') text = text.replace('>', '>') text = text.replace('&', '&') return text.strip() @staticmethod def extract_numbers(text: str) -> List[float]: """ 從文本中提取數(shù)字 """ numbers = re.findall(r'\d+\.?\d*', text) return [float(num) for num in numbers] @staticmethod def parse_date(date_string: str) -> datetime: """ 解析各種日期格式 """ date_patterns = [ '%Y-%m-%d', '%Y/%m/%d', '%d-%m-%Y', '%d/%m/%Y', '%Y-%m-%d %H:%M:%S' ] for pattern in date_patterns: try: return datetime.strptime(date_string.strip(), pattern) except ValueError: continue raise ValueError(f"無法解析日期: {date_string}") # 使用示例 cleaner = DataCleaner() # 清洗提取的數(shù)據(jù) def process_scraped_data(raw_data: Dict) -> Dict: """ 處理爬取的原始數(shù)據(jù) """ processed = {} # 清洗標(biāo)題 processed['title'] = cleaner.clean_text(raw_data.get('title', '')) # 提取和清洗價(jià)格 price_text = raw_data.get('price_text', '') prices = cleaner.extract_numbers(price_text) processed['price'] = prices[0] if prices else 0.0 # 處理日期 date_text = raw_data.get('date', '') try: processed['date'] = cleaner.parse_date(date_text) except ValueError: processed['date'] = None return processed
第六部分:實(shí)戰(zhàn)項(xiàng)目案例
6.1 新聞聚合器
讓我們構(gòu)建一個(gè)完整的新聞聚合器:
import json from dataclasses import dataclass from typing import List import sqlite3 @dataclass class NewsArticle: title: str content: str url: str publish_time: str source: str tags: List[str] class NewsAggregator: def __init__(self, db_path: str = 'news.db'): self.db_path = db_path self.init_database() def init_database(self): """ 初始化數(shù)據(jù)庫 """ conn = sqlite3.connect(self.db_path) cursor = conn.cursor() cursor.execute(''' CREATE TABLE IF NOT EXISTS articles ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, content TEXT, url TEXT UNIQUE, publish_time TEXT, source TEXT, tags TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) ''') conn.commit() conn.close() def scrape_news_site(self, base_url: str, site_config: Dict) -> List[NewsArticle]: """ 根據(jù)配置抓取新聞?wù)军c(diǎn) """ articles = [] try: response = requests.get(base_url) soup = BeautifulSoup(response.content, 'lxml') # 根據(jù)配置提取文章鏈接 article_links = soup.select(site_config['article_selector']) for link in article_links[:10]: # 限制抓取數(shù)量 article_url = link.get('href') if not article_url.startswith('http'): article_url = base_url + article_url # 抓取具體文章 article = self.scrape_article(article_url, site_config) if article: articles.append(article) # 避免請求過快 time.sleep(1) except Exception as e: print(f"抓取 {base_url} 失敗: {e}") return articles def scrape_article(self, url: str, config: Dict) -> NewsArticle: """ 抓取單篇文章 """ try: response = requests.get(url) soup = BeautifulSoup(response.content, 'lxml') # 提取標(biāo)題 title_elem = soup.select_one(config['title_selector']) title = title_elem.text.strip() if title_elem else "" # 提取內(nèi)容 content_elems = soup.select(config['content_selector']) content = '\n'.join([elem.text.strip() for elem in content_elems]) # 提取發(fā)布時(shí)間 time_elem = soup.select_one(config.get('time_selector', '')) publish_time = time_elem.text.strip() if time_elem else "" # 提取標(biāo)簽 tag_elems = soup.select(config.get('tag_selector', '')) tags = [tag.text.strip() for tag in tag_elems] return NewsArticle( title=title, content=content, url=url, publish_time=publish_time, source=config['source_name'], tags=tags ) except Exception as e: print(f"抓取文章 {url} 失敗: {e}") return None def save_articles(self, articles: List[NewsArticle]): """ 保存文章到數(shù)據(jù)庫 """ conn = sqlite3.connect(self.db_path) cursor = conn.cursor() for article in articles: try: cursor.execute(''' INSERT OR IGNORE INTO articles (title, content, url, publish_time, source, tags) VALUES (?, ?, ?, ?, ?, ?) ''', ( article.title, article.content, article.url, article.publish_time, article.source, json.dumps(article.tags) )) except Exception as e: print(f"保存文章失敗: {e}") conn.commit() conn.close() # 使用示例 aggregator = NewsAggregator() # 配置不同新聞?wù)军c(diǎn) sites_config = { 'tech_news': { 'url': 'https://technews.example.com', 'source_name': '科技新聞', 'article_selector': 'a.article-link', 'title_selector': 'h1.article-title', 'content_selector': 'div.article-content p', 'time_selector': 'time.publish-time', 'tag_selector': 'span.tag' } } # 抓取和保存新聞 for site_name, config in sites_config.items(): print(f"正在抓取 {site_name}...") articles = aggregator.scrape_news_site(config['url'], config) aggregator.save_articles(articles) print(f"完成 {site_name},共抓取 {len(articles)} 篇文章")
6.2 錯(cuò)誤處理與重試機(jī)制
在實(shí)際應(yīng)用中,網(wǎng)絡(luò)請求經(jīng)常會失敗,我們需要建立完善的錯(cuò)誤處理機(jī)制:
import time import random from functools import wraps def retry_on_failure(max_retries: int = 3, delay: float = 1.0): """ 失敗重試裝飾器 """ def decorator(func): @wraps(func) def wrapper(*args, **kwargs): last_exception = None for attempt in range(max_retries + 1): try: return func(*args, **kwargs) except Exception as e: last_exception = e if attempt < max_retries: wait_time = delay * (2 ** attempt) + random.uniform(0, 1) print(f"第 {attempt + 1} 次嘗試失敗,{wait_time:.2f}秒后重試...") time.sleep(wait_time) else: print(f"所有重試都失敗了,最后的錯(cuò)誤: {e}") raise last_exception return wrapper return decorator class RobustScraper: def __init__(self): self.session = requests.Session() self.session.headers.update({ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }) @retry_on_failure(max_retries=3, delay=1.0) def fetch_page(self, url: str) -> BeautifulSoup: """ 獲取頁面內(nèi)容,帶重試機(jī)制 """ response = self.session.get(url, timeout=10) response.raise_for_status() if response.status_code == 200: return BeautifulSoup(response.content, 'lxml') else: raise Exception(f"HTTP狀態(tài)碼: {response.status_code}") def safe_extract_text(self, soup: BeautifulSoup, selector: str, default: str = "") -> str: """ 安全地提取文本,避免元素不存在的錯(cuò)誤 """ try: element = soup.select_one(selector) return element.text.strip() if element else default except Exception as e: print(f"提取文本失敗 ({selector}): {e}") return default def safe_extract_attr(self, soup: BeautifulSoup, selector: str, attr: str, default: str = "") -> str: """ 安全地提取屬性值 """ try: element = soup.select_one(selector) return element.get(attr, default) if element else default except Exception as e: print(f"提取屬性失敗 ({selector}, {attr}): {e}") return default
第七部分:性能優(yōu)化與最佳實(shí)踐
7.1 內(nèi)存優(yōu)化技巧
處理大量數(shù)據(jù)時(shí),內(nèi)存管理變得至關(guān)重要:
import gc from contextlib import contextmanager @contextmanager def memory_efficient_parsing(html_content: str, parser: str = 'lxml'): """ 內(nèi)存高效的HTML解析上下文管理器 """ soup = None try: soup = BeautifulSoup(html_content, parser) yield soup finally: if soup: soup.decompose() # 釋放內(nèi)存 del soup gc.collect() # 強(qiáng)制垃圾回收 def process_large_html_file(file_path: str): """ 處理大型HTML文件的示例 """ with open(file_path, 'r', encoding='utf-8') as f: html_content = f.read() with memory_efficient_parsing(html_content) as soup: # 只提取需要的數(shù)據(jù) results = [] # 使用生成器避免一次性加載所有數(shù)據(jù) for element in soup.find_all('div', class_='data-item'): data = { 'id': element.get('id'), 'text': element.text.strip() } results.append(data) # 定期清理已處理的元素 if len(results) % 1000 == 0: element.decompose() return results
7.2 并發(fā)處理優(yōu)化
import asyncio import aiohttp from aiohttp import ClientSession from bs4 import BeautifulSoup class AsyncScraper: def __init__(self, max_concurrent: int = 10): self.max_concurrent = max_concurrent self.semaphore = asyncio.Semaphore(max_concurrent) async def fetch_page(self, session: ClientSession, url: str) -> Dict: """ 異步獲取頁面 """ async with self.semaphore: try: async with session.get(url) as response: if response.status == 200: html = await response.text() return await self.parse_page(html, url) else: return {'url': url, 'error': f'HTTP {response.status}'} except Exception as e: return {'url': url, 'error': str(e)} async def parse_page(self, html: str, url: str) -> Dict: """ 異步解析頁面(在線程池中運(yùn)行) """ loop = asyncio.get_event_loop() return await loop.run_in_executor(None, self._parse_html, html, url) def _parse_html(self, html: str, url: str) -> Dict: """ 同步HTML解析函數(shù) """ soup = BeautifulSoup(html, 'lxml') title = soup.find('title') title_text = title.text.strip() if title else "" return { 'url': url, 'title': title_text, 'success': True } async def scrape_urls(self, urls: List[str]) -> List[Dict]: """ 批量異步抓取URL """ async with aiohttp.ClientSession() as session: tasks = [self.fetch_page(session, url) for url in urls] results = await asyncio.gather(*tasks, return_exceptions=True) # 處理異常結(jié)果 processed_results = [] for result in results: if isinstance(result, Exception): processed_results.append({'error': str(result)}) else: processed_results.append(result) return processed_results # 使用示例 async def main(): scraper = AsyncScraper(max_concurrent=5) urls = [f'https://example.com/page/{i}' for i in range(1, 21)] results = await scraper.scrape_urls(urls) successful = [r for r in results if r.get('success')] failed = [r for r in results if 'error' in r] print(f"成功: {len(successful)}, 失敗: {len(failed)}") # 運(yùn)行異步代碼 # asyncio.run(main())
第八部分:常見問題與解決方案
8.1 編碼問題處理
import chardet def smart_decode(content: bytes) -> str: """ 智能解碼HTML內(nèi)容 """ # 先嘗試檢測編碼 detected = chardet.detect(content) encoding = detected.get('encoding', 'utf-8') try: return content.decode(encoding) except UnicodeDecodeError: # 如果檢測失敗,嘗試常見編碼 encodings = ['utf-8', 'gbk', 'gb2312', 'big5', 'latin1'] for enc in encodings: try: return content.decode(enc) except UnicodeDecodeError: continue # 最后使用錯(cuò)誤處理 return content.decode('utf-8', errors='ignore') # 使用示例 response = requests.get('https://example.com') html_content = smart_decode(response.content) soup = BeautifulSoup(html_content, 'lxml')
8.2 動態(tài)內(nèi)容處理
有些網(wǎng)站使用JavaScript動態(tài)加載內(nèi)容,BeautifulSoup無法直接處理:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC class DynamicContentScraper: def __init__(self, headless: bool = True): options = webdriver.ChromeOptions() if headless: options.add_argument('--headless') options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') self.driver = webdriver.Chrome(options=options) self.wait = WebDriverWait(self.driver, 10) def scrape_dynamic_page(self, url: str) -> BeautifulSoup: """ 抓取動態(tài)加載的頁面 """ self.driver.get(url) # 等待特定元素加載完成 self.wait.until( EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content")) ) # 獲取完整的HTML html = self.driver.page_source return BeautifulSoup(html, 'lxml') def close(self): """ 關(guān)閉瀏覽器 """ self.driver.quit() # 使用示例 scraper = DynamicContentScraper() try: soup = scraper.scrape_dynamic_page('https://dynamic-example.com') # 現(xiàn)在可以用BeautifulSoup處理動態(tài)加載的內(nèi)容了 data = soup.find_all('div', class_='dynamic-content') finally: scraper.close()
結(jié)語:掌握BeautifulSoup的藝術(shù)
通過本文的學(xué)習(xí),你已經(jīng)掌握了BeautifulSoup的核心技能:
- 理解HTML解析的本質(zhì):從文檔樹結(jié)構(gòu)到元素定位
- 掌握數(shù)據(jù)提取技巧:從基礎(chǔ)選擇器到高級CSS選擇器
- 學(xué)會性能優(yōu)化:從單線程到異步并發(fā)處理
- 建立最佳實(shí)踐:從錯(cuò)誤處理到內(nèi)存管理
BeautifulSoup不僅僅是一個(gè)工具,更是一種思維方式。它教會我們?nèi)绾蜗到y(tǒng)化地分析和處理結(jié)構(gòu)化數(shù)據(jù),這種能力在數(shù)據(jù)科學(xué)、爬蟲開發(fā)、自動化測試等多個(gè)領(lǐng)域都非常有價(jià)值。
記住,技術(shù)的掌握需要實(shí)踐。建議你選擇一個(gè)感興趣的網(wǎng)站,運(yùn)用本文介紹的技巧,構(gòu)建自己的數(shù)據(jù)提取項(xiàng)目。在實(shí)踐中遇到問題時(shí),回頭查閱本文的相關(guān)章節(jié),相信你會有更深的理解。
最后,隨著網(wǎng)絡(luò)技術(shù)的發(fā)展,網(wǎng)頁結(jié)構(gòu)也在不斷變化。保持學(xué)習(xí)的心態(tài),關(guān)注新技術(shù)的發(fā)展,才能在數(shù)據(jù)提取的道路上走得更遠(yuǎn)。
以上就是Python使用BeautifulSoup提取網(wǎng)頁數(shù)據(jù)的完整指南的詳細(xì)內(nèi)容,更多關(guān)于Python BeautifulSoup提取網(wǎng)頁數(shù)據(jù)的資料請關(guān)注腳本之家其它相關(guān)文章!
- Python使用BeautifulSoup和Scrapy抓取網(wǎng)頁數(shù)據(jù)的具體教程
- Python使用BeautifulSoup抓取和解析網(wǎng)頁數(shù)據(jù)的操作方法
- Python利用BeautifulSoup解析網(wǎng)頁內(nèi)容
- Python爬蟲之使用BeautifulSoup和Requests抓取網(wǎng)頁數(shù)據(jù)
- Python如何使用BeautifulSoup爬取網(wǎng)頁信息
- python基于BeautifulSoup實(shí)現(xiàn)抓取網(wǎng)頁指定內(nèi)容的方法
- Python通過BeautifulSoup抓取網(wǎng)頁數(shù)據(jù)并解析
相關(guān)文章
python實(shí)踐項(xiàng)目之監(jiān)控當(dāng)前聯(lián)網(wǎng)狀態(tài)詳情
介紹一個(gè)利用Python監(jiān)控當(dāng)前聯(lián)網(wǎng)狀態(tài)情況的python代碼,它可以清楚地知道,你的電腦網(wǎng)絡(luò)是否是鏈接成功或失敗,下面小編帶大家來一起學(xué)習(xí)它2019-05-05解決pycharm不能自動保存在遠(yuǎn)程linux中的問題
這篇文章主要介紹了解決pycharm不能自動保存在遠(yuǎn)程linux中的問題,具有很好的參考價(jià)值,希望對大家有所幫助。一起跟隨小編過來看看吧2021-02-02python kafka 多線程消費(fèi)者&手動提交實(shí)例
今天小編就為大家分享一篇python kafka 多線程消費(fèi)者&手動提交實(shí)例,具有很好的參考價(jià)值,希望對大家有所幫助。一起跟隨小編過來看看吧2019-12-12Python圖片轉(zhuǎn)換成矩陣,矩陣數(shù)據(jù)轉(zhuǎn)換成圖片的實(shí)例
今天小編就為大家分享一篇Python圖片轉(zhuǎn)換成矩陣,矩陣數(shù)據(jù)轉(zhuǎn)換成圖片的實(shí)例,具有很好的參考價(jià)值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-07-07Python ldap實(shí)現(xiàn)登錄實(shí)例代碼
今天給大家分享python idap實(shí)現(xiàn)登錄的實(shí)例代碼,代碼簡單易懂,需要的朋友一起看看吧2016-09-09pip?install?jupyterlab失敗的原因問題及探索
在學(xué)習(xí)Yolo模型時(shí),嘗試安裝JupyterLab但遇到錯(cuò)誤,錯(cuò)誤提示缺少Rust和Cargo編譯環(huán)境,因?yàn)閜ywinpty包需要它們來編譯,由于在conda環(huán)境下操作,Rust和Cargo已經(jīng)安裝,問題是pywinpty包丟失,安裝pywinpty包后,再次執(zhí)行pip?install?jupyterlab即可正常下載2025-02-02python 將md5轉(zhuǎn)為16字節(jié)的方法
今天小編就為大家分享一篇python 將md5轉(zhuǎn)為16字節(jié)的方法,具有很好的參考價(jià)值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-05-05