欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

SpringBoot+WebMagic實現(xiàn)網(wǎng)頁爬蟲的示例代碼

 更新時間:2023年10月13日 10:46:50   作者:終碼一生  
本文是對spring?boot+WebMagic+MyBatis做了整合,使用WebMagic爬取數(shù)據(jù),然后通過MyBatis持久化爬取的數(shù)據(jù)到mysql數(shù)據(jù)庫,具有一定的參考價值,感興趣的可以了解一下

WebMagic是一個開源的java爬蟲框架。

WebMagic框架的使用并不是本文的重點,具體如何使用請參考官方文檔:http://webmagic.io/docs/

本文是對spring boot+WebMagic+MyBatis做了整合,使用WebMagic爬取數(shù)據(jù),然后通過MyBatis持久化爬取的數(shù)據(jù)到mysql數(shù)據(jù)庫。

本文提供的源代碼可以作為java爬蟲項目的腳手架。

1.添加maven依賴

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>hyzx</groupId>
    <artifactId>qbasic-crawler</artifactId>
    <version>1.0.0</version>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>1.5.21.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.test.skip>true</maven.test.skip>
        <java.version>1.8</java.version>
        <maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version>
        <maven.resources.plugin.version>3.1.0</maven.resources.plugin.version>
        <mysql.connector.version>5.1.47</mysql.connector.version>
        <druid.spring.boot.starter.version>1.1.17</druid.spring.boot.starter.version>
        <mybatis.spring.boot.starter.version>1.3.4</mybatis.spring.boot.starter.version>
        <fastjson.version>1.2.58</fastjson.version>
        <commons.lang3.version>3.9</commons.lang3.version>
        <joda.time.version>2.10.2</joda.time.version>
        <webmagic.core.version>0.7.3</webmagic.core.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
            <scope>runtime</scope>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-configuration-processor</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.connector.version}</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>druid-spring-boot-starter</artifactId>
            <version>${druid.spring.boot.starter.version}</version>
        </dependency>
        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>${mybatis.spring.boot.starter.version}</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>${fastjson.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>${commons.lang3.version}</version>
        </dependency>
        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>${joda.time.version}</version>
        </dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>${webmagic.core.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>${maven.compiler.plugin.version}</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                    <encoding>${project.build.sourceEncoding}</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-resources-plugin</artifactId>
                <version>${maven.resources.plugin.version}</version>
                <configuration>
                    <encoding>${project.build.sourceEncoding}</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <configuration>
                    <fork>true</fork>
                    <addResources>true</addResources>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>repackage</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
    <repositories>
        <repository>
            <id>public</id>
            <name>aliyun nexus</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
        </repository>
    </repositories>
    <pluginRepositories>
        <pluginRepository>
            <id>public</id>
            <name>aliyun nexus</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </pluginRepository>
    </pluginRepositories>
</project>

2.項目配置文件 application.properties

配置mysql數(shù)據(jù)源,druid數(shù)據(jù)庫連接池以及MyBatis的mapper文件的位置。

# mysql數(shù)據(jù)源配置
spring.datasource.name=mysql
spring.datasource.type=com.alibaba.druid.pool.DruidDataSource
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://192.168.0.63:3306/gjhzjl?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=true
spring.datasource.username=root
spring.datasource.password=root
# druid數(shù)據(jù)庫連接池配置
spring.datasource.druid.initial-size=5
spring.datasource.druid.min-idle=5
spring.datasource.druid.max-active=10
spring.datasource.druid.max-wait=60000
spring.datasource.druid.validation-query=SELECT 1 FROM DUAL
spring.datasource.druid.test-on-borrow=false
spring.datasource.druid.test-on-return=false
spring.datasource.druid.test-while-idle=true
spring.datasource.druid.time-between-eviction-runs-millis=60000
spring.datasource.druid.min-evictable-idle-time-millis=300000
spring.datasource.druid.max-evictable-idle-time-millis=600000
# mybatis配置
mybatis.mapperLocations=classpath:mapper/**/*.xml

3.數(shù)據(jù)庫表結構

CREATE TABLE `cms_content` (
  `contentId` varchar(40) NOT NULL COMMENT '內容ID',
  `title` varchar(150) NOT NULL COMMENT '標題',
  `content` longtext COMMENT '文章內容',
  `releaseDate` datetime NOT NULL COMMENT '發(fā)布日期',
  PRIMARY KEY (`contentId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='CMS內容表';

4.實體類

import java.util.Date;
public class CmsContentPO {
    private String contentId;
    private String title;
    private String content;
    private Date releaseDate;
    public String getContentId() {
        return contentId;
    }
    public void setContentId(String contentId) {
        this.contentId = contentId;
    }
    public String getTitle() {
        return title;
    }
    public void setTitle(String title) {
        this.title = title;
    }
    public String getContent() {
        return content;
    }
    public void setContent(String content) {
        this.content = content;
    }
    public Date getReleaseDate() {
        return releaseDate;
    }
    public void setReleaseDate(Date releaseDate) {
        this.releaseDate = releaseDate;
    }
}

5.mapper接口

public interface CrawlerMapper {
    int addCmsContent(CmsContentPO record);
}

6.CrawlerMapper.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="com.hyzx.qbasic.dao.CrawlerMapper">
    <insert id="addCmsContent" parameterType="com.hyzx.qbasic.model.CmsContentPO">
        insert into cms_content (contentId,
                                 title,
                                 releaseDate,
                                 content)
        values (#{contentId,jdbcType=VARCHAR},
                #{title,jdbcType=VARCHAR},
                #{releaseDate,jdbcType=TIMESTAMP},
                #{content,jdbcType=LONGVARCHAR})
    </insert>
</mapper>

7.XXX頁面內容處理類XXXPageProcessor

主要用于解析爬取到的XXX html頁面。

@Component
public class XXXPageProcessor implements PageProcessor {
    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("https://www\\.xxx\\.com/question/\\d+/answer/\\d+.*").all());
        page.putField("title", page.getHtml().xpath("http://h1[@class='QuestionHeader-title']/text()").toString());
        page.putField("answer", page.getHtml().xpath("http://div[@class='QuestionAnswer-content']/tidyText()").toString());
        if (page.getResultItems().get("title") == null) {
            // 如果是列表頁,跳過此頁,pipeline不進行后續(xù)處理
            page.setSkip(true);
        }
    }
    @Override
    public Site getSite() {
        return site;
    }
}

8.XXX數(shù)據(jù)處理類XXXPipeline

主要用于將XXX html頁面解析出的數(shù)據(jù)存儲到mysql數(shù)據(jù)庫。

@Component
public class XXXPipeline implements Pipeline {
    private static final Logger LOGGER = LoggerFactory.getLogger(XXXPipeline.class);
    @Autowired
    private CrawlerMapper crawlerMapper;
    public void process(ResultItems resultItems, Task task) {
        String title = resultItems.get("title");
        String answer = resultItems.get("answer");
        CmsContentPO contentPO = new CmsContentPO();
        contentPO.setContentId(UUID.randomUUID().toString());
        contentPO.setTitle(title);
        contentPO.setReleaseDate(new Date());
        contentPO.setContent(answer);
        try {
            boolean success = crawlerMapper.addCmsContent(contentPO) > 0;
            LOGGER.info("保存文章成功:{}", title);
        } catch (Exception ex) {
            LOGGER.error("保存文章失敗", ex);
        }
    }
}

9.爬蟲任務類XXXTask

每十分鐘啟動一次爬蟲。

@Component
public class XXXTask {
    private static final Logger LOGGER = LoggerFactory.getLogger(XXXPipeline.class);
    @Autowired
    private XXXPipeline XXXPipeline;
    @Autowired
    private XXXPageProcessor xxxPageProcessor;
    private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor();
    public void crawl() {
        // 定時任務,每10分鐘爬取一次
        timer.scheduleWithFixedDelay(() -> {
            Thread.currentThread().setName("xxxCrawlerThread");
            try {
                Spider.create(xxxPageProcessor)
                        // 從https://www.xxx.com/explore開始抓
                        .addUrl("https://www.xxx.com/explore")
                        // 抓取到的數(shù)據(jù)存數(shù)據(jù)庫
                        .addPipeline(xxxPipeline)
                        // 開啟2個線程抓取
                        .thread(2)
                        // 異步啟動爬蟲
                        .start();
            } catch (Exception ex) {
                LOGGER.error("定時抓取數(shù)據(jù)線程執(zhí)行異常", ex);
            }
        }, 0, 10, TimeUnit.MINUTES);
    }
}

10.Spring boot程序啟動類

@SpringBootApplication
@MapperScan(basePackages = "com.hyzx.qbasic.dao")
public class Application implements CommandLineRunner {
    @Autowired
    private XXXTask xxxTask;
    public static void main(String[] args) throws IOException {
        SpringApplication.run(Application.class, args);
    }
    @Override
    public void run(String... strings) throws Exception {
        // 爬取數(shù)據(jù)
        xxxTask.crawl();
    }
}

到此這篇關于SpringBoot+WebMagic實現(xiàn)網(wǎng)頁爬蟲的示例代碼的文章就介紹到這了,更多相關SpringBoot WebMagic網(wǎng)頁爬蟲內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家!

相關文章

  • java注解結合aspectj AOP進行日志打印的操作

    java注解結合aspectj AOP進行日志打印的操作

    這篇文章主要介紹了java注解結合aspectj AOP進行日志打印的操作,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧
    2021-02-02
  • Mybatis中BindingException異常的產(chǎn)生原因及解決過程

    Mybatis中BindingException異常的產(chǎn)生原因及解決過程

    BindingException異常是MyBatis框架中自定義的異常,顧名思義指的是綁定出現(xiàn)問題,下面這篇文章主要給大家介紹了關于MyBatis報錯BindingException異常的產(chǎn)生原因及解決過程,需要的朋友可以參考下
    2023-06-06
  • 使用Maven配置Spring的方法步驟

    使用Maven配置Spring的方法步驟

    這篇文章主要介紹了使用Maven配置Spring的方法步驟,小編覺得挺不錯的,現(xiàn)在分享給大家,也給大家做個參考。一起跟隨小編過來看看吧
    2019-04-04
  • Java超詳細分析繼承與重寫的特點

    Java超詳細分析繼承與重寫的特點

    繼承是Java面向對象編程中的一門。繼承是子類繼承父類的特征和行為,或者是繼承父類得方法,使的子類具有父類得的特性和行為。重寫是子類對父類的允許訪問的方法實行的過程進行重新編寫,返回值和形參都不能改變。就是對原本的父類進行重新編寫,但是外部接口不能被重寫
    2022-05-05
  • java list,set,map,數(shù)組間的相互轉換詳解

    java list,set,map,數(shù)組間的相互轉換詳解

    這篇文章主要介紹了java list,set,map,數(shù)組間的相互轉換詳解的相關資料,這里附有實例代碼,具有參考價值,需要的朋友可以參考下
    2017-01-01
  • Spring?Security權限想要細化到按鈕實現(xiàn)示例

    Spring?Security權限想要細化到按鈕實現(xiàn)示例

    這篇文章主要為大家介紹了Spring?Security權限想要細化到按鈕實現(xiàn)示例,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進步,早日升職加薪
    2022-07-07
  • Application.yml的自定義屬性的讀取方式

    Application.yml的自定義屬性的讀取方式

    這篇文章主要介紹了Application.yml的自定義屬性的讀取方式,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教
    2024-08-08
  • java json不生成null或者空字符串屬性(詳解)

    java json不生成null或者空字符串屬性(詳解)

    下面小編就為大家?guī)硪黄猨ava json不生成null或者空字符串屬性(詳解)。小編覺得挺不錯的,現(xiàn)在就分享給大家,也給大家做個參考。一起跟隨小編過來看看吧
    2017-02-02
  • java?字段值為null,不返回該字段的問題

    java?字段值為null,不返回該字段的問題

    這篇文章主要介紹了java?字段值為null,不返回該字段的問題,具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教
    2022-03-03
  • Spring使用Setter完成依賴注入方式

    Spring使用Setter完成依賴注入方式

    這篇文章主要介紹了Spring使用Setter完成依賴注入方式,具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教
    2021-09-09

最新評論