I need to parse a PDF document.I have a java program to parse PDF file.(when I parse the PDF i used font information of those paragraphs in PDF.I wont convert it into text,because if i convert the PDF into text file i ll lose my font information.so directly i'm parsing the pdf with font information using Apache PDFBox. i load pdf file using following code
String inputFile = "/home/Desktop/CTT/bcreg20130702a.pdf";
File input = new File(inputFile);
pd = PDDocument.load(input);
now i need to write a map-reduce program to parse PDF documents.I cant directly use PDF file as a input to map() function in mapreduce program. I used WholeFileInputFormat to pass the entire document as a single split.but it gives me BytesWritable(value) and filename(key).
I also have SequenceFileFormat of that PDF.
How can i use PDFBox with this SequenceFileFormat or WholeFileInputFormat? and it should retain its font information also.without font information i cant parse my pdf.