springboot+WebMagic+MyBatis爬蟲(chóng)框架的使用
WebMagic是一個(gè)開(kāi)源的java爬蟲(chóng)框架。WebMagic框架的使用并不是本文的重點(diǎn),具體如何使用請(qǐng)參考官方文檔:http://webmagic.io/docs/。
本文是對(duì)spring boot+WebMagic+MyBatis做了整合,使用WebMagic爬取數(shù)據(jù),然后通過(guò)MyBatis持久化爬取的數(shù)據(jù)到mysql數(shù)據(jù)庫(kù)。本文提供的源代碼可以作為java爬蟲(chóng)項(xiàng)目的腳手架。

1.添加maven依賴(lài)
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>hyzx</groupId>
<artifactId>qbasic-crawler</artifactId>
<version>1.0.0</version>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>1.5.21.RELEASE</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.test.skip>true</maven.test.skip>
<java.version>1.8</java.version>
<maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version>
<maven.resources.plugin.version>3.1.0</maven.resources.plugin.version>
<mysql.connector.version>5.1.47</mysql.connector.version>
<druid.spring.boot.starter.version>1.1.17</druid.spring.boot.starter.version>
<mybatis.spring.boot.starter.version>1.3.4</mybatis.spring.boot.starter.version>
<fastjson.version>1.2.58</fastjson.version>
<commons.lang3.version>3.9</commons.lang3.version>
<joda.time.version>2.10.2</joda.time.version>
<webmagic.core.version>0.7.3</webmagic.core.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-devtools</artifactId>
<scope>runtime</scope>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.connector.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid-spring-boot-starter</artifactId>
<version>${druid.spring.boot.starter.version}</version>
</dependency>
<dependency>
<groupId>org.mybatis.spring.boot</groupId>
<artifactId>mybatis-spring-boot-starter</artifactId>
<version>${mybatis.spring.boot.starter.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>${fastjson.version}</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>${commons.lang3.version}</version>
</dependency>
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<version>${joda.time.version}</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>${webmagic.core.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven.compiler.plugin.version}</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
<encoding>${project.build.sourceEncoding}</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<version>${maven.resources.plugin.version}</version>
<configuration>
<encoding>${project.build.sourceEncoding}</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<fork>true</fork>
<addResources>true</addResources>
</configuration>
<executions>
<execution>
<goals>
<goal>repackage</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>public</id>
<name>aliyun nexus</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>public</id>
<name>aliyun nexus</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</pluginRepository>
</pluginRepositories>
</project>
2.項(xiàng)目配置文件 application.properties
配置mysql數(shù)據(jù)源,druid數(shù)據(jù)庫(kù)連接池以及MyBatis的mapper文件的位置。
# mysql數(shù)據(jù)源配置 spring.datasource.name=mysql spring.datasource.type=com.alibaba.druid.pool.DruidDataSource spring.datasource.driver-class-name=com.mysql.jdbc.Driver spring.datasource.url=jdbc:mysql://192.168.0.63:3306/gjhzjl?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=true spring.datasource.username=root spring.datasource.password=root # druid數(shù)據(jù)庫(kù)連接池配置 spring.datasource.druid.initial-size=5 spring.datasource.druid.min-idle=5 spring.datasource.druid.max-active=10 spring.datasource.druid.max-wait=60000 spring.datasource.druid.validation-query=SELECT 1 FROM DUAL spring.datasource.druid.test-on-borrow=false spring.datasource.druid.test-on-return=false spring.datasource.druid.test-while-idle=true spring.datasource.druid.time-between-eviction-runs-millis=60000 spring.datasource.druid.min-evictable-idle-time-millis=300000 spring.datasource.druid.max-evictable-idle-time-millis=600000 # mybatis配置 mybatis.mapperLocations=classpath:mapper/**/*.xml
3.數(shù)據(jù)庫(kù)表結(jié)構(gòu)
CREATE TABLE `cms_content` ( `contentId` varchar(40) NOT NULL COMMENT '內(nèi)容ID', `title` varchar(150) NOT NULL COMMENT '標(biāo)題', `content` longtext COMMENT '文章內(nèi)容', `releaseDate` datetime NOT NULL COMMENT '發(fā)布日期', PRIMARY KEY (`contentId`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='CMS內(nèi)容表';
4.實(shí)體類(lèi)
import java.util.Date;
public class CmsContentPO {
private String contentId;
private String title;
private String content;
private Date releaseDate;
public String getContentId() {
return contentId;
}
public void setContentId(String contentId) {
this.contentId = contentId;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getContent() {
return content;
}
public void setContent(String content) {
this.content = content;
}
public Date getReleaseDate() {
return releaseDate;
}
public void setReleaseDate(Date releaseDate) {
this.releaseDate = releaseDate;
}
}
5.mapper接口
public interface CrawlerMapper {
int addCmsContent(CmsContentPO record);
}
6.CrawlerMapper.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="com.hyzx.qbasic.dao.CrawlerMapper">
<insert id="addCmsContent" parameterType="com.hyzx.qbasic.model.CmsContentPO">
insert into cms_content (contentId,
title,
releaseDate,
content)
values (#{contentId,jdbcType=VARCHAR},
#{title,jdbcType=VARCHAR},
#{releaseDate,jdbcType=TIMESTAMP},
#{content,jdbcType=LONGVARCHAR})
</insert>
</mapper>
7.知乎頁(yè)面內(nèi)容處理類(lèi)ZhihuPageProcessor
主要用于解析爬取到的知乎html頁(yè)面。
@Component
public class ZhihuPageProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
@Override
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("https://www\\.zhihu\\.com/question/\\d+/answer/\\d+.*").all());
page.putField("title", page.getHtml().xpath("http://h1[@class='QuestionHeader-title']/text()").toString());
page.putField("answer", page.getHtml().xpath("http://div[@class='QuestionAnswer-content']/tidyText()").toString());
if (page.getResultItems().get("title") == null) {
// 如果是列表頁(yè),跳過(guò)此頁(yè),pipeline不進(jìn)行后續(xù)處理
page.setSkip(true);
}
}
@Override
public Site getSite() {
return site;
}
}
8.知乎數(shù)據(jù)處理類(lèi)ZhihuPipeline
主要用于將知乎html頁(yè)面解析出的數(shù)據(jù)存儲(chǔ)到mysql數(shù)據(jù)庫(kù)。
@Component
public class ZhihuPipeline implements Pipeline {
private static final Logger LOGGER = LoggerFactory.getLogger(ZhihuPipeline.class);
@Autowired
private CrawlerMapper crawlerMapper;
public void process(ResultItems resultItems, Task task) {
String title = resultItems.get("title");
String answer = resultItems.get("answer");
CmsContentPO contentPO = new CmsContentPO();
contentPO.setContentId(UUID.randomUUID().toString());
contentPO.setTitle(title);
contentPO.setReleaseDate(new Date());
contentPO.setContent(answer);
try {
boolean success = crawlerMapper.addCmsContent(contentPO) > 0;
LOGGER.info("保存知乎文章成功:{}", title);
} catch (Exception ex) {
LOGGER.error("保存知乎文章失敗", ex);
}
}
}
9.知乎爬蟲(chóng)任務(wù)類(lèi)ZhihuTask
每十分鐘啟動(dòng)一次爬蟲(chóng)。
@Component
public class ZhihuTask {
private static final Logger LOGGER = LoggerFactory.getLogger(ZhihuPipeline.class);
@Autowired
private ZhihuPipeline zhihuPipeline;
@Autowired
private ZhihuPageProcessor zhihuPageProcessor;
private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor();
public void crawl() {
// 定時(shí)任務(wù),每10分鐘爬取一次
timer.scheduleWithFixedDelay(() -> {
Thread.currentThread().setName("zhihuCrawlerThread");
try {
Spider.create(zhihuPageProcessor)
// 從https://www.zhihu.com/explore開(kāi)始抓
.addUrl("https://www.zhihu.com/explore")
// 抓取到的數(shù)據(jù)存數(shù)據(jù)庫(kù)
.addPipeline(zhihuPipeline)
// 開(kāi)啟2個(gè)線程抓取
.thread(2)
// 異步啟動(dòng)爬蟲(chóng)
.start();
} catch (Exception ex) {
LOGGER.error("定時(shí)抓取知乎數(shù)據(jù)線程執(zhí)行異常", ex);
}
}, 0, 10, TimeUnit.MINUTES);
}
}
10.Spring boot程序啟動(dòng)類(lèi)
@SpringBootApplication
@MapperScan(basePackages = "com.hyzx.qbasic.dao")
public class Application implements CommandLineRunner {
@Autowired
private ZhihuTask zhihuTask;
public static void main(String[] args) throws IOException {
SpringApplication.run(Application.class, args);
}
@Override
public void run(String... strings) throws Exception {
// 爬取知乎數(shù)據(jù)
zhihuTask.crawl();
}
}
到此這篇關(guān)于springboot+WebMagic+MyBatis爬蟲(chóng)框架的使用的文章就介紹到這了,更多相關(guān)springboot+WebMagic+MyBatis爬蟲(chóng)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
http協(xié)議進(jìn)階之Transfer-Encoding和HttpCore實(shí)現(xiàn)詳解
這篇文章主要給大家介紹了http協(xié)議之Transfer-Encoding和HttpCore實(shí)現(xiàn)的相關(guān)資料,文中介紹的非常詳細(xì),相信對(duì)大家具有一定的參考價(jià)值,需要的朋友們下面來(lái)一起看看吧。2017-04-04
基于SSM實(shí)現(xiàn)學(xué)生管理系統(tǒng)
這篇文章主要為大家詳細(xì)介紹了基于SSM實(shí)現(xiàn)學(xué)生管理系統(tǒng),文中示例代碼介紹的非常詳細(xì),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2020-12-12
SpringBoot實(shí)現(xiàn)多個(gè)ApplicationRunner時(shí)部分接口未執(zhí)行問(wèn)題
這篇文章主要介紹了SpringBoot實(shí)現(xiàn)多個(gè)ApplicationRunner時(shí)部分接口未執(zhí)行問(wèn)題,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2023-05-05
SpringCloud Eureka服務(wù)發(fā)現(xiàn)實(shí)現(xiàn)過(guò)程
這篇文章主要介紹了SpringCloud Eureka服務(wù)發(fā)現(xiàn)實(shí)現(xiàn)過(guò)程,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2019-11-11
java堆棧類(lèi)使用實(shí)例(java中stack的使用方法)
java中stack的使用方法,堆棧是一種"后進(jìn)先出"(LIFO) 的數(shù)據(jù)結(jié)構(gòu), 只能在一端進(jìn)行插入(稱(chēng)為"壓棧") 或刪除 (稱(chēng)為"出棧")數(shù)據(jù)的操作,下面看示例吧2013-12-12
Mybatis主配置文件的properties標(biāo)簽詳解
這篇文章主要介紹了Mybatis主配置文件的properties標(biāo)簽,本文給大家介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2020-08-08
實(shí)例講解Java并發(fā)編程之ThreadLocal類(lèi)
這篇文章主要介紹了實(shí)例講解Java并發(fā)編程之ThreadLocal類(lèi),本文給出了模擬ThreadLocal、實(shí)用ThreadLocal等代碼實(shí)例,需要的朋友可以參考下2015-04-04
java如何用反射將一個(gè)對(duì)象復(fù)制給另一個(gè)對(duì)象
這篇文章主要介紹了java如何用反射將一個(gè)對(duì)象復(fù)制給另一個(gè)對(duì)象問(wèn)題,具有很好的參考價(jià)值,希望對(duì)大家有所幫助,如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2023-09-09
spring cloud實(shí)現(xiàn)前端跨域問(wèn)題的解決方案
這篇文章主要介紹了 spring cloud實(shí)現(xiàn)前端跨域問(wèn)題的解決方案,小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,也給大家做個(gè)參考。一起跟隨小編過(guò)來(lái)看看吧2018-01-01

