欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

SpringBoot+WebMagic+MyBaties實現(xiàn)爬蟲和數(shù)據(jù)入庫的示例

 更新時間:2021年10月17日 11:41:22   作者:非空子集  
WebMagic是一個開源爬蟲框架,本項目通過在SpringBoot項目中使用WebMagic去抓取數(shù)據(jù),最后使用MyBatis將數(shù)據(jù)入庫。具有一定的參考價值,感興趣的小伙伴們可以參考一下

WebMagic是一個開源爬蟲框架,本項目通過在SpringBoot項目中使用WebMagic去抓取數(shù)據(jù),最后使用MyBatis將數(shù)據(jù)入庫。

本項目代碼地址:ArticleCrawler: SrpingBoot+WebMagic+MyBaties實現(xiàn)爬蟲和數(shù)據(jù)入庫 (gitee.com)

創(chuàng)建數(shù)據(jù)庫:

本示例中庫名為article,表名為cms_content,表中包含contentId、title、date三個字段。

CREATE TABLE `cms_content` (
  `contentId` varchar(40) NOT NULL COMMENT '內(nèi)容ID',
  `title` varchar(150) NOT NULL COMMENT '標題',
  `date` varchar(150) NOT NULL COMMENT '發(fā)布日期',
  PRIMARY KEY (`contentId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='CMS內(nèi)容表';

新建SpringBoot項目:

1、配置依賴pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.5.5</version>
        <relativePath/>
    </parent>
    <groupId>com.example</groupId>
    <artifactId>Article</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>Article</name>
    <description>Article</description>
    <properties>
        <java.version>1.8</java.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.test.skip>true</maven.test.skip>
        <maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version>
        <maven.resources.plugin.version>3.1.0</maven.resources.plugin.version>

        <mysql.connector.version>5.1.47</mysql.connector.version>
        <druid.spring.boot.starter.version>1.1.17</druid.spring.boot.starter.version>
        <mybatis.spring.boot.starter.version>1.3.4</mybatis.spring.boot.starter.version>
        <fastjson.version>1.2.58</fastjson.version>
        <commons.lang3.version>3.9</commons.lang3.version>
        <joda.time.version>2.10.2</joda.time.version>
        <webmagic.core.version>0.7.5</webmagic.core.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>


        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-configuration-processor</artifactId>
            <optional>true</optional>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.connector.version}</version>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>druid-spring-boot-starter</artifactId>
            <version>${druid.spring.boot.starter.version}</version>
        </dependency>

        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>${mybatis.spring.boot.starter.version}</version>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>${fastjson.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>${commons.lang3.version}</version>
        </dependency>

        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>${joda.time.version}</version>
        </dependency>

        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>${webmagic.core.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>${maven.compiler.plugin.version}</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                    <encoding>${project.build.sourceEncoding}</encoding>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-resources-plugin</artifactId>
                <version>${maven.resources.plugin.version}</version>
                <configuration>
                    <encoding>${project.build.sourceEncoding}</encoding>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <configuration>
                    <fork>true</fork>
                    <addResources>true</addResources>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>repackage</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

    <repositories>
        <repository>
            <id>public</id>
            <name>aliyun nexus</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
        </repository>
    </repositories>

    <pluginRepositories>
        <pluginRepository>
            <id>public</id>
            <name>aliyun nexus</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </pluginRepository>
    </pluginRepositories>

</project>

2、創(chuàng)建CmsContentPO.java

數(shù)據(jù)實體,和表中3個字段對應。

package site.exciter.article.model;

public class CmsContentPO {
    private String contentId;

    private String title;

    private String date;

    public String getContentId() {
        return contentId;
    }

    public void setContentId(String contentId) {
        this.contentId = contentId;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getDate() {
        return date;
    }

    public void setDate(String date) {
        this.date = date;
    }
}

3、創(chuàng)建CrawlerMapper.java

package site.exciter.article.dao;

import org.apache.ibatis.annotations.Mapper;
import site.exciter.article.model.CmsContentPO;

@Mapper
public interface CrawlerMapper {
    int addCmsContent(CmsContentPO record);
}

4、配置映射文件CrawlerMapper.xml

在resources下新建mapper文件夾,在mapper下創(chuàng)建CrawlerMapper.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="site.exciter.article.dao.CrawlerMapper">

    <insert id="addCmsContent" parameterType="site.exciter.article.model.CmsContentPO">
        insert into cms_content (contentId,
        title,
        date)
        values (#{contentId,jdbcType=VARCHAR},
        #{title,jdbcType=VARCHAR},
        #{date,jdbcType=VARCHAR})
    </insert>
</mapper>

5、配置application.properties

配置數(shù)據(jù)庫和mybatis映射關系。

# mysql
spring.datasource.name=mysql
spring.datasource.type=com.alibaba.druid.pool.DruidDataSource
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://10.201.61.184:3306/article?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=true
spring.datasource.username=root
spring.datasource.password=root

# druid
spring.datasource.druid.initial-size=5
spring.datasource.druid.min-idle=5
spring.datasource.druid.max-active=10
spring.datasource.druid.max-wait=60000
spring.datasource.druid.validation-query=SELECT 1 FROM DUAL
spring.datasource.druid.test-on-borrow=false
spring.datasource.druid.test-on-return=false
spring.datasource.druid.test-while-idle=true
spring.datasource.druid.time-between-eviction-runs-millis=60000
spring.datasource.druid.min-evictable-idle-time-millis=300000
spring.datasource.druid.max-evictable-idle-time-millis=600000

# mybatis
mybatis.mapperLocations=classpath:mapper/CrawlerMapper.xml

6、創(chuàng)建ArticlePageProcessor.java

解析html的邏輯。

package site.exciter.article;

import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;

@Component
public class ArticlePageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        String detail_urls_Xpath = "http://*[@class='postTitle']/a[@class='postTitle2']/@href";
        String next_page_xpath = "http://*[@id='nav_next_page']/a/@href";
        String next_page_css = "#homepage_top_pager > div:nth-child(1) > a:nth-child(7)";
        String title_xpath = "http://h1[@class='postTitle']/a/span/text()";
        String date_xpath = "http://span[@id='post-date']/text()";
        page.putField("title", page.getHtml().xpath(title_xpath).toString());
        if (page.getResultItems().get("title") == null) {
            page.setSkip(true);
        }
        page.putField("date", page.getHtml().xpath(date_xpath).toString());

        if (page.getHtml().xpath(detail_urls_Xpath).match()) {
            Selectable detailUrls = page.getHtml().xpath(detail_urls_Xpath);
            page.addTargetRequests(detailUrls.all());
        }

        if (page.getHtml().xpath(next_page_xpath).match()) {
            Selectable nextPageUrl = page.getHtml().xpath(next_page_xpath);
            page.addTargetRequests(nextPageUrl.all());

        } else if (page.getHtml().css(next_page_css).match()) {
            Selectable nextPageUrl = page.getHtml().css(next_page_css).links();
            page.addTargetRequests(nextPageUrl.all());
        }
    }

    @Override
    public Site getSite() {
        return site;
    }
}

7、創(chuàng)建ArticlePipeline.java

處理數(shù)據(jù)的持久化。

package site.exciter.article;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import site.exciter.article.model.CmsContentPO;
import site.exciter.article.dao.CrawlerMapper;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

import java.util.UUID;

@Component
public class ArticlePipeline implements Pipeline {

    private static final Logger LOGGER = LoggerFactory.getLogger(ArticlePipeline.class);

    @Autowired
    private CrawlerMapper crawlerMapper;

    public void process(ResultItems resultItems, Task task) {
        String title = resultItems.get("title");
        String date = resultItems.get("date");

        CmsContentPO contentPO = new CmsContentPO();
        contentPO.setContentId(UUID.randomUUID().toString());
        contentPO.setTitle(title);
        contentPO.setDate(date);

        try {
            boolean success = crawlerMapper.addCmsContent(contentPO) > 0;
            LOGGER.info("保存成功:{}", title);
        } catch (Exception ex) {
            LOGGER.error("保存失敗", ex);
        }
    }
}

8、創(chuàng)建ArticleTask.java

執(zhí)行抓取任務。

package site.exciter.article;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Spider;

import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

@Component
public class ArticleTask {
    private static final Logger LOGGER = LoggerFactory.getLogger(ArticlePipeline.class);

    @Autowired
    private ArticlePipeline articlePipeline;

    @Autowired
    private ArticlePageProcessor articlePageProcessor;

    private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor();

    public void crawl() {
        // 定時任務,每10分鐘爬取一次
        timer.scheduleWithFixedDelay(() -> {
            Thread.currentThread().setName("ArticleCrawlerThread");

            try {
                Spider.create(articlePageProcessor)
                        .addUrl("http://www.cnblogs.com/dick159/default.html?page=2")
                        // 抓取到的數(shù)據(jù)存數(shù)據(jù)庫
                        .addPipeline(articlePipeline)
                        // 開啟5個線程抓取
                        .thread(5)
                        // 異步啟動爬蟲
                        .start();
            } catch (Exception ex) {
                LOGGER.error("定時抓取數(shù)據(jù)線程執(zhí)行異常", ex);
            }
        }, 0, 10, TimeUnit.MINUTES);
    }
}

9、修改Application

package site.exciter.article;

import org.mybatis.spring.annotation.MapperScan;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
@MapperScan(basePackages = "site.exciter.article.interface")
public class ArticleApplication implements CommandLineRunner {

    @Autowired
    private ArticleTask articleTask;

    public static void main(String[] args) {
        SpringApplication.run(ArticleApplication.class, args);
    }

    @Override
    public void run(String... args) throws Exception {
        articleTask.crawl();
    }
}

10、執(zhí)行application,開始抓數(shù)據(jù)并入庫

到此這篇關于SrpingBoot+WebMagic+MyBaties實現(xiàn)爬蟲和數(shù)據(jù)入庫的示例的文章就介紹到這了,更多相關SrpingBoot+WebMagic+MyBaties爬蟲和數(shù)據(jù)入庫內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家!

相關文章

  • 你真的理解Java中的ArrayList嗎

    你真的理解Java中的ArrayList嗎

    這篇文章主要給大家介紹了關于Java中ArrayList的相關資料,ArrayList就是傳說中的動態(tài)數(shù)組,用MSDN中的說法,就是Array的復雜版本,需要的朋友可以參考下
    2021-08-08
  • Java歐拉函數(shù)的計算代碼詳解

    Java歐拉函數(shù)的計算代碼詳解

    這篇文章主要介紹了Java實現(xiàn)歐拉函數(shù)的計算,從歐拉函數(shù)引伸出來在環(huán)論方面的事實和拉格朗日定理構成了歐拉定理的證明,本文通過實例代碼給大家介紹的很詳細,需要的朋友可以參考下
    2021-05-05
  • Java流操作之數(shù)據(jù)流實例代碼

    Java流操作之數(shù)據(jù)流實例代碼

    這篇文章主要介紹了Java流操作之數(shù)據(jù)流實例代碼,具有一定借鑒價值,需要的朋友可以參考下
    2018-01-01
  • 深入理解Java設計模式之狀態(tài)模式

    深入理解Java設計模式之狀態(tài)模式

    這篇文章主要介紹了JAVA設計模式之職責鏈模式的的相關資料,文中示例代碼非常詳細,供大家參考和學習,感興趣的朋友可以了解
    2021-11-11
  • 談談對Java中的volatile的理解

    談談對Java中的volatile的理解

    這篇文章主要介紹了對Java中的volatile的理解,本文給大家介紹的非常詳細,對大家的學習或工作具有一定的參考借鑒價值,需要的朋友可以參考下
    2020-11-11
  • MapReduce實現(xiàn)TopN效果示例解析

    MapReduce實現(xiàn)TopN效果示例解析

    這篇文章主要為大家介紹了MapReduce實現(xiàn)TopN效果示例解析,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進步,早日升職加薪
    2023-07-07
  • Java并發(fā)編程之死鎖相關知識整理

    Java并發(fā)編程之死鎖相關知識整理

    前篇文章在講解線程安全的時候,有提到過為了保證每個線程都能正常執(zhí)行共享資源操作,Java引入了鎖機制,雖然這樣使多線程改善了系統(tǒng)的處理能力,然而也帶來了新的問題,其中之一:死鎖,需要的朋友可以參考下
    2021-06-06
  • Java項目部署的完整流程(超詳細!)

    Java項目部署的完整流程(超詳細!)

    我相信很多Java新手都會遇到這樣一個問題,跟著教材敲代碼,很容易,但是讓他完整的實現(xiàn)一個應用項目卻不會,下面這篇文章主要給大家介紹了關于Java項目部署的完整流程,需要的朋友可以參考下
    2022-07-07
  • java中transient關鍵字分析

    java中transient關鍵字分析

    這篇文章主要介紹了java中transient關鍵字分析,transient與類對象的序列化息息相關,序列化保存的是 類對象 狀態(tài),被transient關鍵字修飾的成員變量,在類的實例化對象的序列化處理過程中會被忽略,變量不會貫穿對象的序列化和反序列化,需要的朋友可以參考下
    2023-09-09
  • Spring之動態(tài)注冊bean的實現(xiàn)方法

    Spring之動態(tài)注冊bean的實現(xiàn)方法

    這篇文章主要介紹了Spring之動態(tài)注冊bean的實現(xiàn)方法,小編覺得挺不錯的,現(xiàn)在分享給大家,也給大家做個參考。一起跟隨小編過來看看吧
    2018-08-08

最新評論