Java將CSV的數(shù)據(jù)發(fā)送到kafka的示例
為什么將CSV的數(shù)據(jù)發(fā)到kafka
- flink做流式計(jì)算時(shí),選用kafka消息作為數(shù)據(jù)源是常用手段,因此在學(xué)習(xí)和開發(fā)flink過程中,也會(huì)將數(shù)據(jù)集文件中的記錄發(fā)送到kafka,來(lái)模擬不間斷數(shù)據(jù);
- 整個(gè)流程如下:

- 您可能會(huì)覺得這樣做多此一舉:flink直接讀取CSV不就行了嗎?這樣做的原因如下:
- 首先,這是學(xué)習(xí)和開發(fā)時(shí)的做法,數(shù)據(jù)集是CSV文件,而生產(chǎn)環(huán)境的實(shí)時(shí)數(shù)據(jù)卻是kafka數(shù)據(jù)源;
- 其次,Java應(yīng)用中可以加入一些特殊邏輯,例如數(shù)據(jù)處理,匯總統(tǒng)計(jì)(用來(lái)和flink結(jié)果對(duì)比驗(yàn)證);
- 另外,如果兩條記錄實(shí)際的間隔時(shí)間如果是1分鐘,那么Java應(yīng)用在發(fā)送消息時(shí)也可以間隔一分鐘再發(fā)送,這個(gè)邏輯在flink社區(qū)的demo中有具體的實(shí)現(xiàn),此demo也是將數(shù)據(jù)集發(fā)送到kafka,再由flink消費(fèi)kafka,地址是:https://github.com/ververica/sql-training
如何將CSV的數(shù)據(jù)發(fā)送到kafka
前面的圖可以看出,讀取CSV再發(fā)送消息到kafka的操作是Java應(yīng)用所為,因此今天的主要工作就是開發(fā)這個(gè)Java應(yīng)用,并驗(yàn)證;
版本信息
- JDK:1.8.0_181
- 開發(fā)工具:IntelliJ IDEA 2019.2.1 (Ultimate Edition)
- 開發(fā)環(huán)境:Win10
- Zookeeper:3.4.13
- Kafka:2.4.0(scala:2.12)
關(guān)于數(shù)據(jù)集
- 本次實(shí)戰(zhàn)用到的數(shù)據(jù)集是CSV文件,里面是一百零四萬(wàn)條淘寶用戶行為數(shù)據(jù),該數(shù)據(jù)來(lái)源是阿里云天池公開數(shù)據(jù)集,我對(duì)此數(shù)據(jù)做了少量調(diào)整;
- 此CSV文件可以在CSDN下載,地址:https://download.csdn.net/download/boling_cavalry/12381698
- 也可以在我的Github下載,地址:https://raw.githubusercontent.com/zq2599/blog_demos/master/files/UserBehavior.7z
- 該CSV文件的內(nèi)容,一共有六列,每列的含義如下表:
| 列名稱 | 說明 |
|---|---|
| 用戶ID | 整數(shù)類型,序列化后的用戶ID |
| 商品ID | 整數(shù)類型,序列化后的商品ID |
| 商品類目ID | 整數(shù)類型,序列化后的商品所屬類目ID |
| 行為類型 | 字符串,枚舉類型,包括('pv', 'buy', 'cart', 'fav') |
| 時(shí)間戳 | 行為發(fā)生的時(shí)間戳 |
| 時(shí)間字符串 | 根據(jù)時(shí)間戳字段生成的時(shí)間字符串 |
- 關(guān)于該數(shù)據(jù)集的詳情,請(qǐng)參考《準(zhǔn)備數(shù)據(jù)集用于flink學(xué)習(xí)》
Java應(yīng)用簡(jiǎn)介
編碼前,先把具體內(nèi)容列出來(lái),然后再挨個(gè)實(shí)現(xiàn):
- 從CSV讀取記錄的工具類:UserBehaviorCsvFileReader
- 每條記錄對(duì)應(yīng)的Bean類:UserBehavior
- Java對(duì)象序列化成JSON的序列化類:JsonSerializer
- 向kafka發(fā)送消息的工具類:KafkaProducer
- 應(yīng)用類,程序入口:SendMessageApplication
上述五個(gè)類即可完成Java應(yīng)用的工作,接下來(lái)開始編碼吧;
直接下載源碼
如果您不想寫代碼,您可以直接從GitHub下載這個(gè)工程的源碼,地址和鏈接信息如下表所示:
| 名稱 | 鏈接 | 備注 |
|---|---|---|
| 項(xiàng)目主頁(yè) | https://github.com/zq2599/blog_demos | 該項(xiàng)目在GitHub上的主頁(yè) |
| git倉(cāng)庫(kù)地址(https) | https://github.com/zq2599/blog_demos.git | 該項(xiàng)目源碼的倉(cāng)庫(kù)地址,https協(xié)議 |
| git倉(cāng)庫(kù)地址(ssh) | git@github.com:zq2599/blog_demos.git | 該項(xiàng)目源碼的倉(cāng)庫(kù)地址,ssh協(xié)議 |
這個(gè)git項(xiàng)目中有多個(gè)文件夾,本章源碼在flinksql這個(gè)文件夾下,如下圖紅框所示:

編碼
創(chuàng)建maven工程,pom.xml如下,比較重要的jackson和javacsv的依賴:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.bolingcavalry</groupId>
<artifactId>flinksql</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<flink.version>1.10.0</flink.version>
<kafka.version>2.2.0</kafka.version>
<java.version>1.8</java.version>
<scala.binary.version>2.11</scala.binary.version>
<maven.compiler.source>${java.version}</maven.compiler.source>
<maven.compiler.target>${java.version}</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>${kafka.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.9.10.1</version>
</dependency>
<!-- Logging dependencies -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.7</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>net.sourceforge.javacsv</groupId>
<artifactId>javacsv</artifactId>
<version>2.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- Java Compiler -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<!-- Shade plugin to include all dependencies -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<!-- Run shade goal on package phase -->
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<excludes>
</excludes>
</artifactSet>
<filters>
<filter>
<!-- Do not copy the signatures in the META-INF folder.
Otherwise, this might cause SecurityExceptions when using the JAR. -->
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
從CSV讀取記錄的工具類:UserBehaviorCsvFileReader,后面在主程序中會(huì)用到j(luò)ava8的Steam API來(lái)處理集合,所以UserBehaviorCsvFileReader實(shí)現(xiàn)了Supplier接口:
public class UserBehaviorCsvFileReader implements Supplier<UserBehavior> {
private final String filePath;
private CsvReader csvReader;
public UserBehaviorCsvFileReader(String filePath) throws IOException {
this.filePath = filePath;
try {
csvReader = new CsvReader(filePath);
csvReader.readHeaders();
} catch (IOException e) {
throw new IOException("Error reading TaxiRecords from file: " + filePath, e);
}
}
@Override
public UserBehavior get() {
UserBehavior userBehavior = null;
try{
if(csvReader.readRecord()) {
csvReader.getRawRecord();
userBehavior = new UserBehavior(
Long.valueOf(csvReader.get(0)),
Long.valueOf(csvReader.get(1)),
Long.valueOf(csvReader.get(2)),
csvReader.get(3),
new Date(Long.valueOf(csvReader.get(4))*1000L));
}
} catch (IOException e) {
throw new NoSuchElementException("IOException from " + filePath);
}
if (null==userBehavior) {
throw new NoSuchElementException("All records read from " + filePath);
}
return userBehavior;
}
}
每條記錄對(duì)應(yīng)的Bean類:UserBehavior,和CSV記錄格式保持一致即可,表示時(shí)間的ts字段,使用了JsonFormat注解,在序列化的時(shí)候以此來(lái)控制格式:
public class UserBehavior {
@JsonFormat
private long user_id;
@JsonFormat
private long item_id;
@JsonFormat
private long category_id;
@JsonFormat
private String behavior;
@JsonFormat(shape = JsonFormat.Shape.STRING, pattern = "yyyy-MM-dd'T'HH:mm:ss'Z'")
private Date ts;
public UserBehavior() {
}
public UserBehavior(long user_id, long item_id, long category_id, String behavior, Date ts) {
this.user_id = user_id;
this.item_id = item_id;
this.category_id = category_id;
this.behavior = behavior;
this.ts = ts;
}
}
Java對(duì)象序列化成JSON的序列化類:JsonSerializer
public class JsonSerializer<T> {
private final ObjectMapper jsonMapper = new ObjectMapper();
public String toJSONString(T r) {
try {
return jsonMapper.writeValueAsString(r);
} catch (JsonProcessingException e) {
throw new IllegalArgumentException("Could not serialize record: " + r, e);
}
}
public byte[] toJSONBytes(T r) {
try {
return jsonMapper.writeValueAsBytes(r);
} catch (JsonProcessingException e) {
throw new IllegalArgumentException("Could not serialize record: " + r, e);
}
}
}
向kafka發(fā)送消息的工具類:KafkaProducer:
public class KafkaProducer implements Consumer<UserBehavior> {
private final String topic;
private final org.apache.kafka.clients.producer.KafkaProducer<byte[], byte[]> producer;
private final JsonSerializer<UserBehavior> serializer;
public KafkaProducer(String kafkaTopic, String kafkaBrokers) {
this.topic = kafkaTopic;
this.producer = new org.apache.kafka.clients.producer.KafkaProducer<>(createKafkaProperties(kafkaBrokers));
this.serializer = new JsonSerializer<>();
}
@Override
public void accept(UserBehavior record) {
// 將對(duì)象序列化成byte數(shù)組
byte[] data = serializer.toJSONBytes(record);
// 封裝
ProducerRecord<byte[], byte[]> kafkaRecord = new ProducerRecord<>(topic, data);
// 發(fā)送
producer.send(kafkaRecord);
// 通過sleep控制消息的速度,請(qǐng)依據(jù)自身kafka配置以及flink服務(wù)器配置來(lái)調(diào)整
try {
Thread.sleep(500);
}catch(InterruptedException e){
e.printStackTrace();
}
}
/**
* kafka配置
* @param brokers The brokers to connect to.
* @return A Kafka producer configuration.
*/
private static Properties createKafkaProperties(String brokers) {
Properties kafkaProps = new Properties();
kafkaProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class.getCanonicalName());
kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class.getCanonicalName());
return kafkaProps;
}
}
最后是應(yīng)用類SendMessageApplication,CSV文件路徑、kafka的topic和borker地址都在此設(shè)置,另外借助java8的Stream API,只需少量代碼即可完成所有工作:
public class SendMessageApplication {
public static void main(String[] args) throws Exception {
// 文件地址
String filePath = "D:\\temp\\202005\\02\\UserBehavior.csv";
// kafka topic
String topic = "user_behavior";
// kafka borker地址
String broker = "192.168.50.43:9092";
Stream.generate(new UserBehaviorCsvFileReader(filePath))
.sequential()
.forEachOrdered(new KafkaProducer(topic, broker));
}
}
驗(yàn)證
- 請(qǐng)確保kafka已經(jīng)就緒,并且名為user_behavior的topic已經(jīng)創(chuàng)建;
- 請(qǐng)將CSV文件準(zhǔn)備好;
- 確認(rèn)SendMessageApplication.java中的文件地址、kafka topic、kafka broker三個(gè)參數(shù)準(zhǔn)確無(wú)誤;
- 運(yùn)行SendMessageApplication.java;
- 開啟一個(gè) 控制臺(tái)消息kafka消息,參考命令如下:
./kafka-console-consumer.sh \ --bootstrap-server 127.0.0.1:9092 \ --topic user_behavior \ --consumer-property group.id=old-consumer-test \ --consumer-property consumer.id=old-consumer-cl \ --from-beginning
- 正常情況下可以立即見到消息,如下圖:

至此,通過Java應(yīng)用模擬用戶行為消息流的操作就完成了,接下來(lái)的flink實(shí)戰(zhàn)就用這個(gè)作為數(shù)據(jù)源;
以上就是Java將CSV的數(shù)據(jù)發(fā)送到kafka得示例的詳細(xì)內(nèi)容,更多關(guān)于Java CSV的數(shù)據(jù)發(fā)送到kafka的資料請(qǐng)關(guān)注腳本之家其它相關(guān)文章!
相關(guān)文章
SpringBoot如何讀取配置文件參數(shù)并全局使用
這篇文章主要介紹了SpringBoot如何讀取配置文件參數(shù)并全局使用,文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2019-12-12
Java中的異常處理(try,catch,finally,throw,throws)
本文主要介紹了Java中的異常處理,文章主要介紹的異常處理包括5個(gè)關(guān)鍵字try,catch,finally,throw,throws,更多詳細(xì)內(nèi)容需要的朋友可以參考一下2022-06-06
Mybatis的TypeHandler實(shí)現(xiàn)數(shù)據(jù)加解密詳解
這篇文章主要介紹了Mybatis基于TypeHandler實(shí)現(xiàn)敏感數(shù)據(jù)加密詳解,Typehandler是mybatis提供的一個(gè)接口,通過實(shí)現(xiàn)這個(gè)接口,可以實(shí)現(xiàn)jdbc類型數(shù)據(jù)和java類型數(shù)據(jù)的轉(zhuǎn)換,需要的朋友可以參考下2024-01-01
SpringMVC前端和后端數(shù)據(jù)交互總結(jié)
本篇文章主要介紹了SpringMVC前端和后端數(shù)據(jù)交互總結(jié),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下。2017-03-03
java實(shí)現(xiàn)給第三方接口推送加密數(shù)據(jù)
這篇文章主要介紹了java實(shí)現(xiàn)給第三方接口推送加密數(shù)據(jù)方式,具有很好的參考價(jià)值,希望對(duì)大家有所幫助,如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2023-12-12
SpringBoot部署到外部Tomcat無(wú)法注冊(cè)到Nacos服務(wù)端的解決思路
這篇文章主要介紹了SpringBoot部署到外部Tomcat無(wú)法注冊(cè)到Nacos服務(wù)端,本文給大家分享完美解決思路,結(jié)合實(shí)例代碼給大家講解的非常詳細(xì),需要的朋友可以參考下2023-03-03

