java 讀寫Parquet格式的數(shù)據(jù)的示例代碼
本文介紹了java 讀寫Parquet格式的數(shù)據(jù),分享給大家,具體如下:
import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.util.Random; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.log4j.Logger; import org.apache.parquet.example.data.Group; import org.apache.parquet.example.data.GroupFactory; import org.apache.parquet.example.data.simple.SimpleGroupFactory; import org.apache.parquet.hadoop.ParquetReader; import org.apache.parquet.hadoop.ParquetReader.Builder; import org.apache.parquet.hadoop.ParquetWriter; import org.apache.parquet.hadoop.example.GroupReadSupport; import org.apache.parquet.hadoop.example.GroupWriteSupport; import org.apache.parquet.schema.MessageType; import org.apache.parquet.schema.MessageTypeParser; public class ReadParquet { static Logger logger=Logger.getLogger(ReadParquet.class); public static void main(String[] args) throws Exception { // parquetWriter("test\\parquet-out2","input.txt"); parquetReaderV2("test\\parquet-out2"); } static void parquetReaderV2(String inPath) throws Exception{ GroupReadSupport readSupport = new GroupReadSupport(); Builder<Group> reader= ParquetReader.builder(readSupport, new Path(inPath)); ParquetReader<Group> build=reader.build(); Group line=null; while((line=build.read())!=null){ Group time= line.getGroup("time", 0); //通過下標(biāo)和字段名稱都可以獲取 /*System.out.println(line.getString(0, 0)+"\t"+ line.getString(1, 0)+"\t"+ time.getInteger(0, 0)+"\t"+ time.getString(1, 0)+"\t");*/ System.out.println(line.getString("city", 0)+"\t"+ line.getString("ip", 0)+"\t"+ time.getInteger("ttl", 0)+"\t"+ time.getString("ttl2", 0)+"\t"); //System.out.println(line.toString()); } System.out.println("讀取結(jié)束"); } //新版本中new ParquetReader()所有構(gòu)造方法好像都棄用了,用上面的builder去構(gòu)造對象 static void parquetReader(String inPath) throws Exception{ GroupReadSupport readSupport = new GroupReadSupport(); ParquetReader<Group> reader = new ParquetReader<Group>(new Path(inPath),readSupport); Group line=null; while((line=reader.read())!=null){ System.out.println(line.toString()); } System.out.println("讀取結(jié)束"); } /** * * @param outPath 輸出Parquet格式 * @param inPath 輸入普通文本文件 * @throws IOException */ static void parquetWriter(String outPath,String inPath) throws IOException{ MessageType schema = MessageTypeParser.parseMessageType("message Pair {\n" + " required binary city (UTF8);\n" + " required binary ip (UTF8);\n" + " repeated group time {\n"+ " required int32 ttl;\n"+ " required binary ttl2;\n"+ "}\n"+ "}"); GroupFactory factory = new SimpleGroupFactory(schema); Path path = new Path(outPath); Configuration configuration = new Configuration(); GroupWriteSupport writeSupport = new GroupWriteSupport(); writeSupport.setSchema(schema,configuration); ParquetWriter<Group> writer = new ParquetWriter<Group>(path,configuration,writeSupport); //把本地文件讀取進(jìn)去,用來生成parquet格式文件 BufferedReader br =new BufferedReader(new FileReader(new File(inPath))); String line=""; Random r=new Random(); while((line=br.readLine())!=null){ String[] strs=line.split("\\s+"); if(strs.length==2) { Group group = factory.newGroup() .append("city",strs[0]) .append("ip",strs[1]); Group tmpG =group.addGroup("time"); tmpG.append("ttl", r.nextInt(9)+1); tmpG.append("ttl2", r.nextInt(9)+"_a"); writer.write(group); } } System.out.println("write end"); writer.close(); } }
說下schema(寫Parquet格式數(shù)據(jù)需要schema,讀取的話"自動識別"了schema)
/* * 每一個字段有三個屬性:重復(fù)數(shù)、數(shù)據(jù)類型和字段名,重復(fù)數(shù)可以是以下三種: * required(出現(xiàn)1次) * repeated(出現(xiàn)0次或多次) * optional(出現(xiàn)0次或1次) * 每一個字段的數(shù)據(jù)類型可以分成兩種: * group(復(fù)雜類型) * primitive(基本類型) * 數(shù)據(jù)類型有 * INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY */
這個repeated和required 不光是次數(shù)上的區(qū)別,序列化后生成的數(shù)據(jù)類型也不同,比如repeqted修飾 ttl2 打印出來為 WrappedArray([7,7_a]) 而 required修飾 ttl2 打印出來為 [7,7_a] 除了用MessageTypeParser.parseMessageType類生成MessageType 還可以用下面方法
(注意這里有個坑--spark里會有這個問題--ttl2這里 as(OriginalType.UTF8) 和 required binary city (UTF8)作用一樣,加上UTF8,在讀取的時候可以轉(zhuǎn)為StringType,不加的話會報錯 [B cannot be cast to java.lang.String )
/*MessageType schema = MessageTypeParser.parseMessageType("message Pair {\n" + " required binary city (UTF8);\n" + " required binary ip (UTF8);\n" + "repeated group time {\n"+ "required int32 ttl;\n"+ "required binary ttl2;\n"+ "}\n"+ "}");*/ //import org.apache.parquet.schema.Types; MessageType schema = Types.buildMessage() .required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("city") .required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("ip") .repeatedGroup().required(PrimitiveTypeName.INT32).named("ttl") .required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("ttl2") .named("time") .named("Pair");
解決 [B cannot be cast to java.lang.String 異常:
1.要么生成parquet文件的時候加個UTF8
2.要么讀取的時候再提供一個同樣的schema類指定該字段類型,比如下面:
maven依賴(我用的1.7)
<dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-hadoop</artifactId> <version>1.7.0</version> </dependency>
以上就是本文的全部內(nèi)容,希望對大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
相關(guān)文章
springboot中redis的緩存穿透問題實(shí)現(xiàn)
這篇文章主要介紹了springboot中redis的緩存穿透問題實(shí)現(xiàn),文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2021-02-02基于 SASL/SCRAM 讓 Kafka 實(shí)現(xiàn)動態(tài)授權(quán)認(rèn)證的方法
在大數(shù)據(jù)處理和分析中?Apache Kafka?已經(jīng)成為了一個核心組件,本文將從零開始部署?ZooKeeper?和?Kafka?并通過配置?SASL/SCRAM?和?ACL(訪問控制列表)來增強(qiáng)?Kafka?的安全性,需要的朋友可以參考下2024-07-07SpringBoot自定義HttpMessageConverter操作
這篇文章主要介紹了SpringBoot自定義HttpMessageConverter的操作,具有很好的參考價值,如有錯誤或未考慮完全的地方,望不吝賜教2021-08-08