parse pdf file using mapreduce program in hadoop

Question

I need to parse a PDF document.I have a java program to parse PDF file.(when I parse the PDF i used font information of those paragraphs in PDF.I wont convert it into text,because if i convert the PDF into text file i ll lose my font information.so directly i'm parsing the pdf with font information using Apache PDFBox. i load pdf file using following code

String inputFile = "/home/Desktop/CTT/bcreg20130702a.pdf";
File input = new File(inputFile);
pd = PDDocument.load(input);

now i need to write a map-reduce program to parse PDF documents.I cant directly use PDF file as a input to map() function in mapreduce program. I used WholeFileInputFormat to pass the entire document as a single split.but it gives me BytesWritable(value) and filename(key).

I also have SequenceFileFormat of that PDF.

How can i use PDFBox with this SequenceFileFormat or WholeFileInputFormat? and it should retain its font information also.without font information i cant parse my pdf.

Ashish · Accepted Answer · 2013-09-11 10:22:48Z

3

You can create a SequenceFile to contain the PDF files. SequenceFile is a binary file format. You could make each record in the SequenceFile a PDF. To do this you would create a class derived from Writable which would contain the PDF and any metadata that you needed. Then you could use any java PDF library such as PDFBox to manipulate the PDFs.

answered Sep 11, 2013 at 10:22

Ashish

5,7912 gold badges26 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sudheer · Accepted Answer · 2013-09-13 05:13:37Z

1

You said that you are using your own custom InputFormat(WholeFileInputFormat) In that instead of BytesWritable use PDDocument Object as your value to Map, and load the whole content of pdf into PDDocument in nextKeyValue() of WholeFileRecordReader(custome Reader). Also make sure that ur isSplitable() returns false so that whole pdf will be loaded.

answered Sep 13, 2013 at 5:13

sudheer

3371 gold badge6 silver badges18 bronze badges

Comments

techvineet · Accepted Answer · 2013-09-11 09:14:13Z

-2

Map-Reduce needs input path from HDFS. So, you can upload the local file to HDFS (using java API) in some path/folder and use that as an input to Map-Reduce.

answered Sep 11, 2013 at 9:14

techvineet

5,1112 gold badges32 silver badges29 bronze badges

Collectives™ on Stack Overflow

parse pdf file using mapreduce program in hadoop

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related