0

Hi I anm using PDFBOX external library for parsing the pdf input file in mapreduce,but i am getting the following error.

Error: java.lang.ClassNotFoundException: org.apache.pdfbox.pdmodel.PDDocument at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at com.nielsen.grfe.processor.mapreduce.Pdfparser$PdfLineRecordReader.initialize(Pdfparser.java:109) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

I am using the following dependency

<dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>1.8.10</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>fontbox</artifactId>
        <version>1.8.5</version>
    </dependency>
1
  • @prashant khunt I have added Distributed cache in the code..Still i face the same error.. Commented Dec 9, 2015 at 13:53

1 Answer 1

0

1) Place the jar file of pdfbox in hadoop lib folder too.(make library jar available to hadoop at runtime).

2) Restart hadoop cluster.

Or

1) Make sure that your pdfbox library is available to hadoop by placing it in distributed cache.

Sign up to request clarification or add additional context in comments.

2 Comments

org.apache.pdfbox.pdmodel.PDDocument is not available in the PDFBOX 1.8.10.
As per java docs, pdfbox.apache.org/docs/1.8.10/javadocs/org/apache/pdfbox/… the class is present in PDFBOX 1.8.10. The class is also present in the jar file. can you please paste the Exception that you are getting?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.