java spark文件讀取亂碼問題的解決方法
一、問題
環(huán)境為jdk1.8,spark3.2.1,讀取hadoop中GB18030編碼格式的文件出現(xiàn)亂碼。
二、心酸歷程
為了解決該問題,嘗試過很多種方法,但都沒有成功
1、textFile+Configuration方式——亂碼
String filePath = "hdfs:///user/test.deflate"; //創(chuàng)建SparkSession和SparkContext的實例 String encoding = "GB18030"; SparkSession spark = SparkSession.builder() .master("local[*]").appName("Spark Example") .getOrCreate(); JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext()); Configuration entries = sc.hadoopConfiguration(); entries.set("textinputformat.record.delimiter", "\n"); entries.set("mapreduce.input.fileinputformat.inputdir",filePath);entries.set("mapreduce.input.fileinputformat.encoding", "GB18030"); JavaRDD<String> rdd = sc.textFile(filePath);
2、spark.read().option方式——亂碼
Dataset<Row> load = spark.read().format("text").option("encoding", "GB18030").load(filePath); load.foreach(row -> { System.out.println(row.toString()); System.out.println(new String(row.toString().getBytes(encoding),"UTF-8")); System.out.println(new String(row.toString().getBytes(encoding),"GBK")); });
3、newAPIHadoopFile+Configuration——亂碼
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, TextInputFormat.class, LongWritable.class, Text.class, entries ); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD.count()); longWritableTextJavaPairRDD.foreach(k->{ System.out.println(k._2); });
4、newAPIHadoopFile+自定義類——亂碼
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, GBKInputFormat.class, LongWritable.class, Text.class, entries ); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD.count()); longWritableTextJavaPairRDD.foreach(k->{ System.out.println(k._2); });
代碼中GBKInputFormat.class是TextInputFormat.class復制將內部UTF-8修改為GB18030所得
5、newAPIHadoopRDD+自定義類——亂碼
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = sc.newAPIHadoopRDD(entries, GBKInputFormat.class, LongWritable.class, Text.class); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD1.count()); longWritableTextJavaPairRDD1.foreach(k->{ System.out.println(k._2()); });
三、最終解決
上述方法感覺指定的字符編碼并沒有生效不知道為什么,如有了解原因的還請為我解惑,謝謝
最終解決方案如下
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, TextInputFormat.class, LongWritable.class, Text.class, new Configuration()); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD.count()); longWritableTextJavaPairRDD.foreach(k->{ System.out.println(new String(k._2.copyBytes(), encoding)); }); JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = sc.newAPIHadoopRDD(entries, TextInputFormat.class, LongWritable.class, Text.class); System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD1.count()); longWritableTextJavaPairRDD1.foreach(k->{ System.out.println(new String(k._2().copyBytes(),encoding)); System.out.println(new String(k._2.copyBytes(),encoding)); });
主要是new String(k._2().copyBytes(),encoding)得以解決
到此這篇關于java spark文件讀取亂碼問題的解決方法的文章就介紹到這了,更多相關java spark文件讀取亂碼內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家!
相關文章
tio-boot整合hotswap-classloader實現(xiàn)熱加載方法實例
這篇文章主要為大家介紹了tio-boot整合hotswap-classloader實現(xiàn)熱加載方法實例,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進步,早日升職加薪2023-12-12springboot基于docsify?實現(xiàn)隨身文檔
這篇文章主要介紹了springboot基于docsify實現(xiàn)隨身文檔的相關資料,需要的朋友可以參考下2022-09-09java數(shù)據(jù)庫連接池和數(shù)據(jù)庫連接示例
這篇文章主要介紹了java數(shù)據(jù)庫連接池和數(shù)據(jù)庫連接示例,需要的朋友可以參考下2014-05-05Java并發(fā)編程之CountDownLatch的使用
CountDownLatch是一個倒數(shù)的同步器,常用來讓一個線程等待其他N個線程執(zhí)行完成再繼續(xù)向下執(zhí)行,本文主要介紹了CountDownLatch的具體使用方法,感興趣的可以了解一下2023-05-05Springboot 實現(xiàn)數(shù)據(jù)庫備份還原的方法
這篇文章主要介紹了Springboot 實現(xiàn)數(shù)據(jù)庫備份還原的方法,本文給大家介紹的非常詳細,對大家的學習或工作具有一定的參考借鑒價值,需要的朋友可以參考下2020-09-09