java spark文件讀取亂碼問題的解決方法
一、問題
環(huán)境為jdk1.8,spark3.2.1,讀取hadoop中GB18030編碼格式的文件出現(xiàn)亂碼。
二、心酸歷程
為了解決該問題,嘗試過很多種方法,但都沒有成功
1、textFile+Configuration方式——亂碼
String filePath = "hdfs:///user/test.deflate"; //創(chuàng)建SparkSession和SparkContext的實(shí)例 String encoding = "GB18030"; SparkSession spark = SparkSession.builder() .master("local[*]").appName("Spark Example") .getOrCreate(); JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext()); Configuration entries = sc.hadoopConfiguration(); entries.set("textinputformat.record.delimiter", "\n"); entries.set("mapreduce.input.fileinputformat.inputdir",filePath);entries.set("mapreduce.input.fileinputformat.encoding", "GB18030"); JavaRDD<String> rdd = sc.textFile(filePath);
2、spark.read().option方式——亂碼
Dataset<Row> load = spark.read().format("text").option("encoding", "GB18030").load(filePath); load.foreach(row -> { System.out.println(row.toString()); System.out.println(new String(row.toString().getBytes(encoding),"UTF-8")); System.out.println(new String(row.toString().getBytes(encoding),"GBK")); });
3、newAPIHadoopFile+Configuration——亂碼
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, TextInputFormat.class, LongWritable.class, Text.class, entries ); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD.count()); longWritableTextJavaPairRDD.foreach(k->{ System.out.println(k._2); });
4、newAPIHadoopFile+自定義類——亂碼
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, GBKInputFormat.class, LongWritable.class, Text.class, entries ); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD.count()); longWritableTextJavaPairRDD.foreach(k->{ System.out.println(k._2); });
代碼中GBKInputFormat.class是TextInputFormat.class復(fù)制將內(nèi)部UTF-8修改為GB18030所得
5、newAPIHadoopRDD+自定義類——亂碼
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = sc.newAPIHadoopRDD(entries, GBKInputFormat.class, LongWritable.class, Text.class); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD1.count()); longWritableTextJavaPairRDD1.foreach(k->{ System.out.println(k._2()); });
三、最終解決
上述方法感覺指定的字符編碼并沒有生效不知道為什么,如有了解原因的還請(qǐng)為我解惑,謝謝
最終解決方案如下
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, TextInputFormat.class, LongWritable.class, Text.class, new Configuration()); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD.count()); longWritableTextJavaPairRDD.foreach(k->{ System.out.println(new String(k._2.copyBytes(), encoding)); }); JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = sc.newAPIHadoopRDD(entries, TextInputFormat.class, LongWritable.class, Text.class); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD1.count()); longWritableTextJavaPairRDD1.foreach(k->{ System.out.println(new String(k._2().copyBytes(),encoding)); System.out.println(new String(k._2.copyBytes(),encoding)); });
主要是new String(k._2().copyBytes(),encoding)得以解決
到此這篇關(guān)于java spark文件讀取亂碼問題的解決方法的文章就介紹到這了,更多相關(guān)java spark文件讀取亂碼內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
tio-boot整合hotswap-classloader實(shí)現(xiàn)熱加載方法實(shí)例
這篇文章主要為大家介紹了tio-boot整合hotswap-classloader實(shí)現(xiàn)熱加載方法實(shí)例,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2023-12-12springboot基于docsify?實(shí)現(xiàn)隨身文檔
這篇文章主要介紹了springboot基于docsify實(shí)現(xiàn)隨身文檔的相關(guān)資料,需要的朋友可以參考下2022-09-09java數(shù)據(jù)庫連接池和數(shù)據(jù)庫連接示例
這篇文章主要介紹了java數(shù)據(jù)庫連接池和數(shù)據(jù)庫連接示例,需要的朋友可以參考下2014-05-05Java并發(fā)編程之CountDownLatch的使用
CountDownLatch是一個(gè)倒數(shù)的同步器,常用來讓一個(gè)線程等待其他N個(gè)線程執(zhí)行完成再繼續(xù)向下執(zhí)行,本文主要介紹了CountDownLatch的具體使用方法,感興趣的可以了解一下2023-05-05Springboot 實(shí)現(xiàn)數(shù)據(jù)庫備份還原的方法
這篇文章主要介紹了Springboot 實(shí)現(xiàn)數(shù)據(jù)庫備份還原的方法,本文給大家介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2020-09-09jackson 如何將實(shí)體轉(zhuǎn)json json字符串轉(zhuǎn)實(shí)體
這篇文章主要介紹了jackson 實(shí)現(xiàn)將實(shí)體轉(zhuǎn)json json字符串轉(zhuǎn)實(shí)體,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2021-10-10