3

Hi i am trying to read text from doc and docx file, for doc files i am doing this

package test;
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class ReadFile {
public static void main(String[] args) {
        File file = null;
        WordExtractor extractor = null;
        try {

            file = new File("C:\\Users\\rijo\\Downloads\\r.doc");
            FileInputStream fis = new FileInputStream(file.getAbsolutePath());
            HWPFDocument document = new HWPFDocument(fis);
            extractor = new WordExtractor(document);
            String fileData = extractor.getText();
            System.out.println(fileData);
        } catch (Exception exep) {
        }
    }
}

But this gives me an org/apache/poi/OldFileFormatException exception.

Any idea how to fix this?

Also I need to read Docx and PDF files ? any good way to read all type of files?

1
  • Which version of POI are you using? Commented Oct 14, 2013 at 12:28

3 Answers 3

7

Using the following jars (In case version numbers are playing a role here):

dom4j-1.7-20060614
poi-3.9-20121203
poi-ooxml-3.9-20121203
poi-ooxml-schemas-3.9-20121203
poi-scratchpad-3.9-20121203
xmlbeans-2.4.0

I typed this up:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class SO {
public static void main(String[] args){

            //Alternate between the two to check what works.
    //String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx";
    String FilePath = "D:\\Users\\username\\Desktop\\Bob.doc";
    FileInputStream fis;

    if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx
    try {
        fis = new FileInputStream(new File(FilePath));
        XWPFDocument doc = new XWPFDocument(fis);
        XWPFWordExtractor extract = new XWPFWordExtractor(doc);
        System.out.println(extract.getText());
    } catch (IOException e) {

        e.printStackTrace();
    }
    } else { //is not a docx
        try {
            fis = new FileInputStream(new File(FilePath));
            HWPFDocument doc = new HWPFDocument(fis);
            WordExtractor extractor = new WordExtractor(doc);
            System.out.println(extractor.getText());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
  }
}

this allowed me to read text from both a .docx and .doc respectively. If this doesn't work on your PC you may well have either an issue with the external jars you are using.

Give it a go though :) Good luck!

Sign up to request clarification or add additional context in comments.

1 Comment

@RijoJoseph I have Updated my answer based of your earlier comment.
1

If you look at the javadocs of OldFileFormatException , you can see the reason for that

Base class of all the exceptions that POI throws in the event that it's given a file that's older than currently supported.

This means that the r.doc you're using is not supported by the HWPFDocument. May be it supports the latest format(docx has also been there for quite a long time now. Not sure if ApachePOI supports doc format in the HWPFDocument).

1 Comment

i tried with .docx file ,but gets the same exception.. you know any other way to read all .doc .docx .pdf files??
0

I do not know why you are using WordExtractor just to get text from .doc. For me it was enough to use one method:

import org.apache.poi.hwpf.HWPFDocument;
...
File fin = new File(yourFilePath);
FileInputStream fis = new FileInputStream(fin);
HWPFDocument doc = new HWPFDocument(fis);
String text = doc.getDocumentText();
System.out.println(text);
...

To work with .pdf use another Apache: pdfbox.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.