利用POI讀取word、Excel文件的最佳實踐教程

更新時間：2017年11月27日 10:43:39 作者：neal

Apache POI 是用Java編寫的免費開源的跨平臺的 Java API，Apache POI提供API給Java程式對Microsoft Office格式檔案讀和寫的功能。下面這篇文章主要給大家介紹了關(guān)于利用POI讀取word、Excel文件的最佳實踐的相關(guān)資料，需要的朋友可以參考下。

前言

POI是 Apache 旗下一款讀寫微軟家文檔聲名顯赫的類庫。應該很多人在做報表的導出，或者創(chuàng)建 word 文檔以及讀取之類的都是用過 POI。POI 也的確對于這些操作帶來很大的便利性。我最近做的一個工具就是讀取計算機中的 word 以及 excel 文件。

POI結(jié)構(gòu)說明

包名稱說明

HSSF提供讀寫Microsoft Excel XLS格式檔案的功能。

XSSF提供讀寫Microsoft Excel OOXML XLSX格式檔案的功能。

HWPF提供讀寫Microsoft Word DOC格式檔案的功能。

HSLF提供讀寫Microsoft PowerPoint格式檔案的功能。

HDGF提供讀Microsoft Visio格式檔案的功能。

HPBF提供讀Microsoft Publisher格式檔案的功能。

HSMF提供讀Microsoft Outlook格式檔案的功能。

下面就word和excel兩方面講解以下遇到的一些坑：

word 篇

對于 word 文件，我需要的就是提取文件中正文的文字。所以可以創(chuàng)建一個方法來讀取 doc 或者 docx 文件：

 private static String readDoc(String filePath, InputStream is) {
  String text= "";
  try {
   if (filePath.endsWith("doc")) {
    WordExtractor ex = new WordExtractor(is);
    text = ex.getText();
    ex.close();
    is.close();
   } else if(filePath.endsWith("docx")) {
    XWPFDocument doc = new XWPFDocument(is);
    XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
    text = extractor.getText();
    extractor.close();
    is.close();
   }
  } catch (Exception e) {
   logger.error(filePath, e);
  } finally {
   if (is != null) {
    is.close();
   }
  }
  return text;
 }

理論上來說，這段代碼應該對于讀取大多數(shù) doc 或者 docx 文件都是有效的。但是!!!!我發(fā)現(xiàn)了一個奇怪的問題，就是我的代碼在讀取某些 doc 文件的時候，經(jīng)常會給出這樣的一個異常：

org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents.

這個異常的意思是什么呢，通俗的來講，就是你打開的文件并不是一個 doc 文件，你應該使用讀取 docx 的方法去讀取。但是我們明明打開的就是一個后綴是 doc 的文件?。?/p>

其實 doc 和 docx 的本質(zhì)不同的，doc 是 OLE2 類型，而 docx 而是 OOXML 類型。如果你用壓縮文件打開一個 docx 文件，你會發(fā)現(xiàn)一些文件夾：

本質(zhì)上 docx 文件就是一個 zip 文件，里面包含了一些 xml 文件。所以，一些 docx 文件雖然大小不大，但是其內(nèi)部的 xml 文件確實比較大的，這也是為什么在讀取某些看起來不是很大的 docx 文件的時候卻耗費了大量的內(nèi)存。

然后我使用壓縮文件打開這個 doc 文件，果不其然，其內(nèi)部正是如上圖，所以本質(zhì)上我們可以認為它是一個 docx 文件?？赡苁且驗樗且阅撤N兼容模式保存從而導致如此坑爹的問題。所以，現(xiàn)在我們根據(jù)后綴名來判斷一個文件是 doc 或者 docx 就是不可靠的了。

老實說，我覺得這應該不是一個很少見的問題。但是我在谷歌上并沒有找到任何關(guān)于此的信息。how to know whether a file is .docx or .doc format from Apache POI 這個例子是通過 ZipInputStream 來判斷文件是否是 docx 文件：

boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;

但我并不覺得這是一個很好的方法，因為我得去構(gòu)建一個ZipInpuStream，這很顯然不好。另外，這個操作貌似會影響到 InputStream，所以你在讀取正常的 doc 文件會有問題。或者你使用 File 對象去判斷是否是一個 zip 文件。但這也不是一個好方法，因為我還需要在壓縮文件中讀取 doc 或者 docx 文件，所以我的輸入必須是 Inputstream，所以這個選項也是不可以的。我在 stackoverflow 上和一幫老外扯了大半天，有時候我真的很懷疑這幫老外的理解能力，不過最終還是有一個大佬給出了一個讓我欣喜若狂的解決方案，FileMagic。這個是一個 POI 3.17新增加的一個特性：

public enum FileMagic {
 /** OLE2 / BIFF8+ stream used for Office 97 and higher documents */
 OLE2(HeaderBlockConstants._signature),
 /** OOXML / ZIP stream */
 OOXML(OOXML_FILE_HEADER),
 /** XML file */
 XML(RAW_XML_FILE_HEADER),
 /** BIFF2 raw stream - for Excel 2 */
 BIFF2(new byte[]{
   0x09, 0x00, // sid=0x0009
   0x04, 0x00, // size=0x0004
   0x00, 0x00, // unused
   0x70, 0x00 // 0x70 = multiple values
 }),
 /** BIFF3 raw stream - for Excel 3 */
 BIFF3(new byte[]{
   0x09, 0x02, // sid=0x0209
   0x06, 0x00, // size=0x0006
   0x00, 0x00, // unused
   0x70, 0x00 // 0x70 = multiple values
 }),
 /** BIFF4 raw stream - for Excel 4 */
 BIFF4(new byte[]{
   0x09, 0x04, // sid=0x0409
   0x06, 0x00, // size=0x0006
   0x00, 0x00, // unused
   0x70, 0x00 // 0x70 = multiple values
 },new byte[]{
   0x09, 0x04, // sid=0x0409
   0x06, 0x00, // size=0x0006
   0x00, 0x00, // unused
   0x00, 0x01
 }),
 /** Old MS Write raw stream */
 MSWRITE(
   new byte[]{0x31, (byte)0xbe, 0x00, 0x00 },
   new byte[]{0x32, (byte)0xbe, 0x00, 0x00 }),
 /** RTF document */
 RTF("{\\rtf"),
 /** PDF document */
 PDF("%PDF"),
 // keep UNKNOWN always as last enum!
 /** UNKNOWN magic */
 UNKNOWN(new byte[0]);

 final byte[][] magic;

 FileMagic(long magic) {
  this.magic = new byte[1][8];
  LittleEndian.putLong(this.magic[0], 0, magic);
 }

 FileMagic(byte[]... magic) {
  this.magic = magic;
 }

 FileMagic(String magic) {
  this(magic.getBytes(LocaleUtil.CHARSET_1252));
 }

 public static FileMagic valueOf(byte[] magic) {
  for (FileMagic fm : values()) {
   int i=0;
   boolean found = true;
   for (byte[] ma : fm.magic) {
    for (byte m : ma) {
     byte d = magic[i++];
     if (!(d == m || (m == 0x70 && (d == 0x10 || d == 0x20 || d == 0x40)))) {
      found = false;
      break;
     }
    }
    if (found) {
     return fm;
    }
   }
  }
  return UNKNOWN;
 }

 /**
  * Get the file magic of the supplied InputStream (which MUST
  * support mark and reset).<p>
  *
  * If unsure if your InputStream does support mark / reset,
  * use {@link #prepareToCheckMagic(InputStream)} to wrap it and make
  * sure to always use that, and not the original!<p>
  *
  * Even if this method returns {@link FileMagic#UNKNOWN} it could potentially mean,
  * that the ZIP stream has leading junk bytes
  *
  * @param inp An InputStream which supports either mark/reset
  */
 public static FileMagic valueOf(InputStream inp) throws IOException {
  if (!inp.markSupported()) {
   throw new IOException("getFileMagic() only operates on streams which support mark(int)");
  }

  // Grab the first 8 bytes
  byte[] data = IOUtils.peekFirst8Bytes(inp);

  return FileMagic.valueOf(data);
 }


 /**
  * Checks if an {@link InputStream} can be reseted (i.e. used for checking the header magic) and wraps it if not
  *
  * @param stream stream to be checked for wrapping
  * @return a mark enabled stream
  */
 public static InputStream prepareToCheckMagic(InputStream stream) {
  if (stream.markSupported()) {
   return stream;
  }
  // we used to process the data via a PushbackInputStream, but user code could provide a too small one
  // so we use a BufferedInputStream instead now
  return new BufferedInputStream(stream);
 }
}

在這給出主要的代碼，其主要就是根據(jù) InputStream 前 8 個字節(jié)來判斷文件的類型，毫無以為這就是最優(yōu)雅的解決方式。一開始，其實我也是在想對于壓縮文件的前幾個字節(jié)似乎是由不同的定義的，magicmumber。因為 FileMagic 的依賴和3.16 版本是兼容的，所以我只需要加入這個類就可以了，因此我們現(xiàn)在讀取 word 文件的正確做法是：

 private static String readDoc (String filePath, InputStream is) {
  String text= "";
  is = FileMagic.prepareToCheckMagic(is);
  try {
   if (FileMagic.valueOf(is) == FileMagic.OLE2) {
    WordExtractor ex = new WordExtractor(is);
    text = ex.getText();
    ex.close();
   } else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
    XWPFDocument doc = new XWPFDocument(is);
    XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
    text = extractor.getText();
    extractor.close();
   }
  } catch (Exception e) {
   logger.error("for file " + filePath, e);
  } finally {
   if (is != null) {
    is.close();
   }
  }
  return text;
 }

excel 篇

對于 excel 篇，我也就不去找之前的方案和現(xiàn)在的方案的對比了。就給出我現(xiàn)在的最佳做法了：

 @SuppressWarnings("deprecation" )
 private static String readExcel(String filePath, InputStream inp) throws Exception {
  Workbook wb;
  StringBuilder sb = new StringBuilder();
  try {
   if (filePath.endsWith(".xls")) {
    wb = new HSSFWorkbook(inp);
   } else {
    wb = StreamingReader.builder()
      .rowCacheSize(1000) // number of rows to keep in memory (defaults to 10)
      .bufferSize(4096)  // buffer size to use when reading InputStream to file (defaults to 1024)
      .open(inp);   // InputStream or File for XLSX file (required)
   }
   sb = readSheet(wb, sb, filePath.endsWith(".xls"));
   wb.close();
  } catch (OLE2NotOfficeXmlFileException e) {
   logger.error(filePath, e);
  } finally {
   if (inp != null) {
    inp.close();
   }
  }
  return sb.toString();
 }

 private static String readExcelByFile(String filepath, File file) {
  Workbook wb;
  StringBuilder sb = new StringBuilder();
  try {
   if (filepath.endsWith(".xls")) {
    wb = WorkbookFactory.create(file);
   } else {
    wb = StreamingReader.builder()
      .rowCacheSize(1000) // number of rows to keep in memory (defaults to 10)
      .bufferSize(4096)  // buffer size to use when reading InputStream to file (defaults to 1024)
      .open(file);   // InputStream or File for XLSX file (required)
   }
   sb = readSheet(wb, sb, filepath.endsWith(".xls"));
   wb.close();
  } catch (Exception e) {
   logger.error(filepath, e);
  }
  return sb.toString();
 }

 private static StringBuilder readSheet(Workbook wb, StringBuilder sb, boolean isXls) throws Exception {
  for (Sheet sheet: wb) {
   for (Row r: sheet) {
    for (Cell cell: r) {
     if (cell.getCellType() == Cell.CELL_TYPE_STRING) {
      sb.append(cell.getStringCellValue());
      sb.append(" ");
     } else if (cell.getCellType() == Cell.CELL_TYPE_NUMERIC) {
      if (isXls) {
       DataFormatter formatter = new DataFormatter();
       sb.append(formatter.formatCellValue(cell));
      } else {
       sb.append(cell.getStringCellValue());
      }
      sb.append(" ");
     }
    }
   }
  }
  return sb;
 }

其實，對于 excel 讀取，我的工具面臨的最大問題就是內(nèi)存溢出。經(jīng)常在讀取某些特別大的 excel 文件的時候都會帶來一個內(nèi)存溢出的問題。后來我終于找到一個優(yōu)秀的工具excel-streaming-reader，它可以流式的讀取 xlsx 文件，將一些特別大的文件拆分成小的文件去讀。

另外一個做的優(yōu)化就是，對于可以使用 File 對象的場景下，我是去使用 File 對象去讀取文件而不是使用 InputStream 去讀取，因為使用 InputStream 需要把它全部加載到內(nèi)存中，所以這樣是非常占用內(nèi)存的。

最后，我的一點小技巧就是使用 cell.getCellType 去減少一些數(shù)據(jù)量，因為我只需要獲取一些文字以及數(shù)字的字符串內(nèi)容就可以了。

以上，就是我在使用 POI 讀取文件的一些探索和發(fā)現(xiàn)，希望對你能有所幫助。上面的這些例子也是在我的一款工具 everywhere 中的應用（這款工具主要是可以幫助你在電腦中進行內(nèi)容的全文搜索），感興趣的可以看看，歡迎 star 或者 pr。

總結(jié)

以上就是這篇文章的全部內(nèi)容了，希望本文的內(nèi)容對大家的學習或者工作具有一定的參考學習價值，如果有疑問大家可以留言交流，謝謝大家對腳本之家的支持。

您可能感興趣的文章: