詳解python定時簡單爬取網(wǎng)頁新聞存入數(shù)據(jù)庫并發(fā)送郵件

更新時間：2020年11月27日 10:47:04 作者：Andyren0126

這篇文章主要介紹了python定時簡單爬取網(wǎng)頁新聞存入數(shù)據(jù)庫并發(fā)送郵件，文中通過示例代碼介紹的非常詳細(xì)，對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值，需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧

本人小白一枚，簡單記錄下學(xué)校作業(yè)項目，代碼十分簡單，主要是對各個庫的理解，希望能給別的初學(xué)者一點(diǎn)啟發(fā)。

一、項目要求

1、程序可以從北京工業(yè)大學(xué)首頁上爬取新聞內(nèi)容：http://www.bjut.edu.cn

在這里插入圖片描述

2、程序可以將爬取下來的數(shù)據(jù)寫入本地MySQL數(shù)據(jù)庫中。

3、程序可以將爬取下來的數(shù)據(jù)發(fā)送到郵箱。

4、程序可以定時執(zhí)行。

二、項目分析

1、爬蟲部分利用requests庫爬取html文本，再利用bs4中的BeaultifulSoup庫來解析html文本，提取需要的內(nèi)容。

2、使用pymysql庫連接MySQL數(shù)據(jù)庫，實現(xiàn)建表和插入內(nèi)容操作。

3、使用smtplib庫建立郵箱連接，再使用email庫將文本信息加工成郵件消息并發(fā)送。

4、使用schedule庫實現(xiàn)定時執(zhí)行該程序。

三、代碼分析

1、導(dǎo)入需要的庫：

# 爬蟲相關(guān)模塊
import requests
from bs4 import BeautifulSoup
import pymysql

# 發(fā)郵件相關(guān)模塊
import smtplib
from email.mime.text import MIMEText   
from email.header import Header 
import time

# 定時模塊
import schedule

2、獲取html文件：

# 連接獲取html文本
def getHTMLtext(url):
  try:
    headers={
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
      } # 瀏覽器請求頭
    r = requests.get(url, headers = headers, timeout = 30) # 獲取連接
    r.raise_for_status() # 測試連接是否成功，若失敗則報異常
    r.encoding = r.apparent_encoding # 解析編碼
    return r.text
  except:
    return ""

其中必須添加請求頭headers否則get請求時會返回錯誤頁面。
raise_for_status()可以根據(jù)狀態(tài)碼判斷連接對象的狀態(tài)，如果成功便繼續(xù)執(zhí)行，若連接失敗則拋出異常，因此利用try-except捕獲。
apparent_encoding()方法可以解析判斷可能的編碼方式。

3、解析html提取數(shù)據(jù)：

首先觀察網(wǎng)頁源碼確定新聞標(biāo)簽位置：

在這里插入圖片描述

# 解析html提取數(shù)據(jù)
def parseHTML(news, html):
  soup = BeautifulSoup(html, "html.parser") # 獲取soup
  for i in soup.find(attrs = {'class' : 'list'}).find_all('li'): # 存放新聞的li標(biāo)簽
    date = i.p.string + '-' + i.h2.string # 日期
    href = i.a['href'] # 鏈接
    title = i.find('h1').string # 標(biāo)題
    content = i.find_all('p')[1].string # 梗概
    news.append([date, href, title, content]) # 添加到列表中

可見所有新聞內(nèi)容全部存放在class為”list”的div標(biāo)簽中，而每條新聞又存放在li標(biāo)簽中，因此利用find和find_all方法遍歷所有l(wèi)i標(biāo)簽。

每個li標(biāo)簽中a標(biāo)簽的href屬性存放新聞鏈接，h1標(biāo)簽存放新聞標(biāo)題，h2標(biāo)簽存放日期，第一個p標(biāo)簽存放年、月，第二個p標(biāo)簽存放新聞梗概。依次取出對應(yīng)標(biāo)簽中的文本內(nèi)容，并將年月日拼接后依次存入news列表中。

4、存入數(shù)據(jù)庫

# 存入數(shù)據(jù)庫
def toMysql(news):
  conn = pymysql.connect(host = 'localhost', port = 3306, user = 'root', passwd = '數(shù)據(jù)庫密碼', db = '數(shù)據(jù)庫名稱',charset = 'gbk', connect_timeout = 1000)
  cursor = conn.cursor()
  
  sql = '''
  create table if not exists tb_news(
    日期 date, 
    鏈接 varchar(400),
    標(biāo)題 varchar(400),
    梗概 varchar(400))
  '''
  
  cursor.execute(sql) # 建表
  
  for new in news: # 循環(huán)存入數(shù)據(jù)
    sql = 'insert into tb_news(日期, 鏈接, 標(biāo)題, 梗概) values(%s, %s, %s, %s)'
    date = new[0]
    href = new[1]
    title = new[2]
    content = new[3]
    cursor.execute(sql, (date, href, title, content))
    
  conn.commit()
  conn.close()

由于新聞字?jǐn)?shù)較多，存取時可能會有亂碼以及數(shù)據(jù)過長存儲失敗的問題，與數(shù)據(jù)庫編碼有關(guān)，可以在MySQL的my.ini配置文件中修改默認(rèn)編碼為gbk。

5、發(fā)送郵件

# 發(fā)送郵件
def sendMail(news):
  from_addr = '發(fā)送郵箱' # 發(fā)送郵箱
  password = '16位授權(quán)碼' # 郵箱授權(quán)碼
  
  to_addr = '接收郵箱' # 接收郵箱
  
  mailhost = 'smtp.qq.com' # qq郵箱的smtp地址
  qqmail = smtplib.SMTP() # 建立SMTP對象
  qqmail.connect(mailhost, 25) # 25為SMTP常用端口
  qqmail.login(from_addr, password) # 登錄郵箱
  
  content = ''
  for new in news: # 拼接郵件內(nèi)容字符串
    content += '新聞時間：' + new[0] + '\n' + '新聞鏈接：' + new[1] + '\n' + '新聞標(biāo)題：' + new[2] + '\n' + '新聞梗概：' + new[3] + '\n'
    content += '======================================================================\n'
    
  # 拼接題目字符串
  subject = time.strftime('%Y-%m-%d %X', time.localtime(time.time())) + '時爬取的北工大首頁主要新聞\n'
  
  # 加工郵件message格式
  msg = MIMEText(content, 'plain', 'utf-8')
  msg['subject'] = Header(subject, 'utf-8')
  
  try:
    qqmail.sendmail(from_addr, to_addr, msg.as_string())
    print('發(fā)送成功')
  except:
    print('發(fā)送失敗')
  qqmail.quit()

注意其中的密碼不是指郵箱的登錄密碼，而是指郵箱的smtp授權(quán)碼，qq郵箱可以再設(shè)置中開啟smtp服務(wù)，并獲取授權(quán)碼。

在這里插入圖片描述

6、主函數(shù)

# 主函數(shù)
def main():
  news = []
  url = "http://www.bjut.edu.cn/"
  html = getHTMLtext(url)
	parseHTML(news, html)
	toMysql(news)
  print(news)
	sendMail(news)

輸入北京工業(yè)大學(xué)官網(wǎng)的url并新建一個列表news用來存放消息，然后依次調(diào)用函數(shù)爬取新聞存入數(shù)據(jù)庫并發(fā)到郵箱。為了檢驗上述程序是否可以完成任務(wù)，先調(diào)用依次main()函數(shù)并print(news)看看結(jié)果：

main() #測試需要，之后會刪除

結(jié)果如下：

在這里插入圖片描述

由此可見程序執(zhí)行正常。

7、定時執(zhí)行

# 定時執(zhí)行整個任務(wù)
schedule.every().monday.at("08:00").do(main) # 每周一早上八點(diǎn)執(zhí)行main函數(shù)
while True:
  schedule.run_pending()
  time.sleep(1)

用死循環(huán)保證schedule一直運(yùn)行。設(shè)定的是每周一早上8:00執(zhí)行程序。

為了方便檢查效果，先將運(yùn)行時間改為每5s運(yùn)行一次：

schedule.every(5).seconds.do(main)

在這里插入圖片描述

每5s便可以收到一封郵件，由此可見滿足定時需求。至此程序結(jié)束。

四、完整代碼

# 爬蟲相關(guān)模塊
import requests
from bs4 import BeautifulSoup
import pymysql

# 發(fā)郵件相關(guān)模塊
import smtplib
from email.mime.text import MIMEText   
from email.header import Header 
import time

# 定時模塊
import schedule

# 連接獲取html文本
def getHTMLtext(url):
  try:
    headers={
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
    } # 瀏覽器請求頭
    r = requests.get(url, headers = headers, timeout = 30) # 獲取連接
    r.raise_for_status() # 測試連接是否成功，若失敗則報異常
    r.encoding = r.apparent_encoding # 解析編碼
    return r.text
  except:
    return ""


# 解析html提取數(shù)據(jù)
def parseHTML(news, html):
  soup = BeautifulSoup(html, "html.parser") # 獲取soup
  for i in soup.find(attrs = {'class' : 'list'}).find_all('li'): # 存放新聞的li標(biāo)簽
    date = i.p.string + '-' + i.h2.string # 日期
    href = i.a['href'] # 鏈接
    title = i.find('h1').string # 標(biāo)題
    content = i.find_all('p')[1].string # 梗概
    news.append([date, href, title, content]) # 添加到列表中

# 存入數(shù)據(jù)庫
def toMysql(news):
  conn = pymysql.connect(host = 'localhost', port = 3306, user = 'root', passwd = '數(shù)據(jù)庫密碼', db = '數(shù)據(jù)庫名稱',charset = 'gbk', connect_timeout = 1000)
  cursor = conn.cursor()
  
  sql = '''
  create table if not exists tb_news(
    日期 date, 
    鏈接 varchar(400),
    標(biāo)題 varchar(400),
    梗概 varchar(400))
  '''
  
  cursor.execute(sql) # 建表
  
  for new in news: # 循環(huán)存入數(shù)據(jù)
    sql = 'insert into tb_news(日期, 鏈接, 標(biāo)題, 梗概) values(%s, %s, %s, %s)'
    date = new[0]
    href = new[1]
    title = new[2]
    content = new[3]
    cursor.execute(sql, (date, href, title, content))
    
  conn.commit()
  conn.close()

# 發(fā)送郵件
def sendMail(news):
  from_addr = '發(fā)送郵箱' # 發(fā)送郵箱
  password = '16位授權(quán)碼' # 郵箱授權(quán)碼
  
  to_addr = '接收郵箱' # 接收郵箱
  
  mailhost = 'smtp.qq.com' # qq郵箱的smtp地址
  qqmail = smtplib.SMTP() # 建立SMTP對象
  qqmail.connect(mailhost, 25) # 25為SMTP常用端口
  qqmail.login(from_addr, password) # 登錄郵箱
  
  content = ''
  for new in news: # 拼接郵件內(nèi)容字符串
    content += '新聞時間：' + new[0] + '\n' + '新聞鏈接：' + new[1] + '\n' + '新聞標(biāo)題：' + new[2] + '\n' + '新聞梗概：' + new[3] + '\n'
    content += '======================================================================\n'
    
  # 拼接題目字符串
  subject = time.strftime('%Y-%m-%d %X', time.localtime(time.time())) + '時爬取的北工大首頁主要新聞\n'
  
  # 加工郵件message格式
  msg = MIMEText(content, 'plain', 'utf-8')
  msg['subject'] = Header(subject, 'utf-8')
  
  try:
    qqmail.sendmail(from_addr, to_addr, msg.as_string())
    print('發(fā)送成功')
  except:
    print('發(fā)送失敗')
  qqmail.quit()



# 主函數(shù)
def main():
  news = []
  url = "http://www.bjut.edu.cn/"
  html = getHTMLtext(url)
  parseHTML(news, html)
  print(news)
  sendMail(news)
  
# 定時執(zhí)行整個任務(wù)
schedule.every().monday.at("08:00").do(main) # 每周一早上八點(diǎn)執(zhí)行main函數(shù)
while True:
  schedule.run_pending()
  time.sleep(1)

到此這篇關(guān)于詳解python定時簡單爬取網(wǎng)頁新聞存入數(shù)據(jù)庫并發(fā)送郵件的文章就介紹到這了,更多相關(guān)python定時爬取網(wǎng)頁內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: