1

I need to parse a PDF document.I have a java program to parse PDF file.(when I parse the PDF i used font information of those paragraphs in PDF.I wont convert it into text,because if i convert the PDF into text file i ll lose my font information.so directly i'm parsing the pdf with font information using Apache PDFBox. i load pdf file using following code

String inputFile = "/home/Desktop/CTT/bcreg20130702a.pdf";
File input = new File(inputFile);
pd = PDDocument.load(input);

now i need to write a map-reduce program to parse PDF documents.I cant directly use PDF file as a input to map() function in mapreduce program. I used WholeFileInputFormat to pass the entire document as a single split.but it gives me BytesWritable(value) and filename(key).

I also have SequenceFileFormat of that PDF.

How can i use PDFBox with this SequenceFileFormat or WholeFileInputFormat? and it should retain its font information also.without font information i cant parse my pdf.

3 Answers 3

3

You can create a SequenceFile to contain the PDF files. SequenceFile is a binary file format. You could make each record in the SequenceFile a PDF. To do this you would create a class derived from Writable which would contain the PDF and any metadata that you needed. Then you could use any java PDF library such as PDFBox to manipulate the PDFs.

Sign up to request clarification or add additional context in comments.

Comments

1

You said that you are using your own custom InputFormat(WholeFileInputFormat) In that instead of BytesWritable use PDDocument Object as your value to Map, and load the whole content of pdf into PDDocument in nextKeyValue() of WholeFileRecordReader(custome Reader). Also make sure that ur isSplitable() returns false so that whole pdf will be loaded.

Comments

-2

Map-Reduce needs input path from HDFS. So, you can upload the local file to HDFS (using java API) in some path/folder and use that as an input to Map-Reduce.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.