1

Spark reading data from MongoDB(ver 7.0) and DocumentDB(ver 4.0) and loading into the spark DataFrameReader is failing when DataFrameReader.isEmpty() method is called . SparkSession and DataFrameReader get initiailized well. Call to any of the methods like isEmpty() of DataFrameReader throws a error whose stacktrace is provided here. The version of various libs used: mongo-spark-connector_2.13 version 10.5.0 , spark-core_2.13 version 4.0.0, spark-sql_2.13 version 4.0.0, morphia-core 2.5.0.

Stacktrace:

org.apache.spark.SparkException: [INTERNAL_ERROR] The "isEmpty" action failed. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. SQLSTATE: XX000
        at org.apache.spark.SparkException$.internalError(SparkException.scala:107) ~[spark-common-utils_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:643) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:656) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.classic.Dataset.$anonfun$withAction$1(Dataset.scala:2232) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$8(SQLExecution.scala:162) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.SQLExecution$.withSessionTagsApplied(SQLExecution.scala:268) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$7(SQLExecution.scala:124) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) ~[spark-core_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:106) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:124) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:291) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:123) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804) ~[spark-sql-api_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:77) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:233) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.classic.Dataset.withAction(Dataset.scala:2232) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.classic.Dataset.isEmpty(Dataset.scala:559) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at com.capitalone.embossing.grayhair.job.EmbGhrFileGenerationJob.createGrayHairFile(EmbGhrFileGenerationJob.java:120) ~[classes/:?]
        at com.capitalone.embossing.grayhair.job.EmbGhrFileGenerationJob.doImbFileCreationProcess(EmbGhrFileGenerationJob.java:91) ~[classes/:?]
        at com.capitalone.embossing.grayhair.EmbGhrJobLauncher.processGrayhairFileCreation(EmbGhrJobLauncher.java:68) ~[classes/:?]
        at com.capitalone.embossing.grayhair.EmbGhrJobLauncher.run(EmbGhrJobLauncher.java:38) ~[classes/:?]
        at com.capitalone.embossing.grayhair.EmbGhrFileCreation.executeJob(EmbGhrFileCreation.java:33) ~[classes/:?]
        at com.capitalone.embossing.grayhair.EmbGhrLauncher.perform(EmbGhrLauncher.java:55) ~[classes/:?]
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) ~[?:?]
        at java.base/java.lang.reflect.Method.invoke(Method.java:577) ~[?:?]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.runInternal(ScheduledMethodRunnable.java:130) ~[spring-context-6.2.8.jar:6.2.8]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.lambda$run$2(ScheduledMethodRunnable.java:124) ~[spring-context-6.2.8.jar:6.2.8]
        at io.micrometer.observation.Observation.observe(Observation.java:498) [micrometer-observation-1.14.7.jar:1.14.7]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:124) [spring-context-6.2.8.jar:6.2.8]
        at org.springframework.scheduling.config.Task$OutcomeTrackingRunnable.run(Task.java:85) [spring-context-6.2.8.jar:6.2.8]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-6.2.8.jar:6.2.8]
        at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:96) [spring-context-6.2.8.jar:6.2.8]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.classic.SparkSession.sparkContext()" because the return value of "org.apache.spark.sql.execution.SparkPlan.session()" is null
        at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:68) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.CollectLimitExec.readMetrics$lzycompute(limit.scala:68) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.CollectLimitExec.readMetrics(limit.scala:67) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.CollectLimitExec.metrics$lzycompute(limit.scala:69) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.CollectLimitExec.metrics(limit.scala:69) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.SparkPlan.resetMetrics(SparkPlan.scala:147) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.classic.Dataset.$anonfun$withAction$2(Dataset.scala:2233) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:654) ~[spark-sql_2.13-4.0.0.jar:4.0.0]
        ... 37 more

Code used for reading data from MongoDB to Spark DataFrameReader:

    SparkConf conf = new SparkConf();
    conf.setAppName("EmbGhrJobScheduler")
        .setMaster(appConfig.getSparkMaster())
        .set("spark.app.id", "MongoSparkConnector")
        .set("spark.driver.memory", "12g")
        .set("spark.executor.memory", "24g");
                
    //Initializing SparkSession
    SparkSession session=SparkSession.builder().config(conf).getOrCreate();
    
    Dataset<Row> dataFrameReader = session.read().format("mongodb")
                    .option("spark.mongodb.read.database", "embsgdb01")
                    .option("spark.mongodb.read.collection", "T_EMB_IMB_GRAYHAIR")
                    .option("spark.mongodb.write.collection", "T_EMB_IMB_GRAYHAIR")
                    .option("readPreference.name", "secondaryPreferred")
                    .option("spark.mongodb.input.tls.enabled", "true")
                    .option("spark.mongodb.input.tlsAllowInvalidHostnames", "true")
                    .option("spark.mongodb.input.inferSchema", "true")
                    .option("spark.mongodb.input.collection", "*")
                    .option("connectTimeoutMS", "15000") // 15 seconds
                    .option("socketTimeoutMS", "30000")  // 30 seconds
                    .option("maxTimeMS", "60000")
                    .option("spark.mongodb.input.partitioner", "com.mongodb.spark.sql.connector.read.partitioner.PaginateIntoPartitionsPartitioner")
                    .load();  
    
    
    boolean empty=dataFrameReader.isEmpty();// This is the line that makes mongo-spark-connector throw the error. 

MongoDB server logs while the mongo-spark-connector throws the error


{"t":{"$date":"2025-08-08T20:57:32.935-04:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:50649","isLoadBalanced":false,"uuid":{"uuid":{"$uuid":"870cc5db-3955-4607-a9fa-fed4809f944a"}},"connectionId":211,"connectionCount":17}}
{"t":{"$date":"2025-08-08T20:57:32.940-04:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn211","msg":"client metadata","attr":{"remote":"127.0.0.1:50649","client":"conn211","negotiatedCompressors":[],"doc":{"driver":{"name":"mongo-java-driver|sync|mongo-spark-connector|source","version":"5.5.1|10.4.1"},"os":{"type":"Darwin","name":"Mac OS X","architecture":"x86_64","version":"15.6"},"platform":"Java/Oracle Corporation/18.0.1+10-24|Scala/2.13.16/Spark/4.0.0"}}}
{"t":{"$date":"2025-08-08T20:57:32.944-04:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:50650","isLoadBalanced":false,"uuid":{"uuid":{"$uuid":"feaa6bd6-43dd-438d-a406-b17efb66f069"}},"connectionId":212,"connectionCount":18}}
{"t":{"$date":"2025-08-08T20:57:32.945-04:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn212","msg":"client metadata","attr":{"remote":"127.0.0.1:50650","client":"conn212","negotiatedCompressors":[],"doc":{"driver":{"name":"mongo-java-driver|sync|mongo-spark-connector|source","version":"5.5.1|10.4.1"},"os":{"type":"Darwin","name":"Mac OS X","architecture":"x86_64","version":"15.6"},"platform":"Java/Oracle Corporation/18.0.1+10-24|Scala/2.13.16/Spark/4.0.0"}}}
{"t":{"$date":"2025-08-08T20:57:32.949-04:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:50651","isLoadBalanced":false,"uuid":{"uuid":{"$uuid":"b95cdfc1-fbe3-4fcc-8a82-2be54862fcf1"}},"connectionId":213,"connectionCount":19}}
{"t":{"$date":"2025-08-08T20:57:32.949-04:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn213","msg":"client metadata","attr":{"remote":"127.0.0.1:50651","client":"conn213","negotiatedCompressors":[],"doc":{"driver":{"name":"mongo-java-driver|sync|mongo-spark-connector|source","version":"5.5.1|10.4.1"},"os":{"type":"Darwin","name":"Mac OS X","architecture":"x86_64","version":"15.6"},"platform":"Java/Oracle Corporation/18.0.1+10-24|Scala/2.13.16/Spark/4.0.0"}}}
{"t":{"$date":"2025-08-08T20:57:32.951-04:00"},"s":"I",  "c":"NETWORK",  "id":6788700, "ctx":"conn213","msg":"Received first command on ingress connection since session start or auth handshake","attr":{"elapsedMillis":1}}



{"t":{"$date":"2025-08-08T20:57:38.427-04:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn213","msg":"Connection ended","attr":{"remote":"127.0.0.1:50651","isLoadBalanced":false,"uuid":{"uuid":{"$uuid":"b95cdfc1-fbe3-4fcc-8a82-2be54862fcf1"}},"connectionId":213,"connectionCount":18}}
{"t":{"$date":"2025-08-08T20:57:38.428-04:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn212","msg":"Connection ended","attr":{"remote":"127.0.0.1:50650","isLoadBalanced":false,"uuid":{"uuid":{"$uuid":"feaa6bd6-43dd-438d-a406-b17efb66f069"}},"connectionId":212,"connectionCount":17}}
{"t":{"$date":"2025-08-08T20:57:38.962-04:00"},"s":"W",  "c":"NETWORK",  "id":4615610, "ctx":"conn211","msg":"Failed to check socket connectivity","attr":{"error":{"code":9001,"codeName":"SocketException","errmsg":"Couldn't peek from underlying socket"}}}
{"t":{"$date":"2025-08-08T20:57:38.962-04:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn211","msg":"Interrupted operation as its client disconnected","attr":{"opId":661505}}
{"t":{"$date":"2025-08-08T20:57:38.963-04:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn211","msg":"Connection ended","attr":{"remote":"127.0.0.1:50649","isLoadBalanced":false,"uuid":{"uuid":{"$uuid":"870cc5db-3955-4607-a9fa-fed4809f944a"}},"connectionId":211,"connectionCount":16}}

pom.xml:

      <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.13</artifactId>
            <version>4.0.0</version>
            <exclusions>
                <exclusion>
                    <artifactId>log4j</artifactId>
                    <groupId>log4j</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jackson-databind</artifactId>
                    <groupId>com.fasterxml.jackson.core</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jackson-core</artifactId>
                    <groupId>com.fasterxml.jackson.core</groupId>
                </exclusion>
                <exclusion>
                    <groupId>org.apache.avro</groupId>
                    <artifactId>avro</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>xerces</groupId>
                    <artifactId>xercesImpl</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.tukaani</groupId>
                    <artifactId>xz</artifactId>
                </exclusion>
                <exclusion>
                    <artifactId>commons-httpclient</artifactId>
                    <groupId>commons-httpclient</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jersey-container-servlet</artifactId>
                    <groupId>org.glassfish.jersey.containers</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jersey-client</artifactId>
                    <groupId>org.glassfish.jersey.core</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jersey-common</artifactId>
                    <groupId>org.glassfish.jersey.core</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>zookeeper</artifactId>
                    <groupId>org.apache.zookeeper</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>metrics-graphite</artifactId>
                    <groupId>io.dropwizard.metrics</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>osgi-resource-locator</artifactId>
                    <groupId>org.glassfish.hk2</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jersey-media-jaxb</artifactId>
                    <groupId>org.glassfish.jersey.media</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>javax.activation-api</artifactId>
                    <groupId>javax.activation</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jetty-util</artifactId>
                    <groupId>org.mortbay.jetty</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>javax.annotation-api</artifactId>
                    <groupId>javax.annotation</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jackson-xc</artifactId>
                    <groupId>org.codehaus.jackson</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jackson-jaxrs</artifactId>
                    <groupId>org.codehaus.jackson</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>javax.ws.rs-api</artifactId>
                    <groupId>javax.ws.rs</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>javax.inject</artifactId>
                    <groupId>org.glassfish.hk2.external</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jaxb-api</artifactId>
                    <groupId>javax.xml.bind</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>activation</artifactId>
                    <groupId>javax.activation</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>javax.servlet-api</artifactId>
                    <groupId>javax.servlet</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jackson-mapper-asl</artifactId>
                    <groupId>org.codehaus.jackson</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jackson-core-asl</artifactId>
                    <groupId>org.codehaus.jackson</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>commons-codec</artifactId>
                    <groupId>commons-codec</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jakarta.inject</artifactId>
                    <groupId>org.glassfish.hk2.external</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jakarta.annotation-api</artifactId>
                    <groupId>jakarta.annotation</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>jakarta.ws.rs-api</artifactId>
                    <groupId>jakarta.ws.rs</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>netty-all</artifactId>
                    <groupId>io.netty</groupId>
                </exclusion>
                <exclusion>
                    <groupId>org.apache.commons</groupId>
                    <artifactId>commons-compress</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>commons-io</groupId>
                    <artifactId>commons-io</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>dev.morphia.morphia</groupId>
            <artifactId>morphia-core</artifactId>
            <version>2.5.0</version>

            <exclusions>
                <exclusion>
                    <groupId>org.mongodb</groupId>
                    <artifactId>mongo-java-driver</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.mongodb.spark</groupId>
            <artifactId>mongo-spark-connector_2.13</artifactId>
            <version>10.4.1</version>
            <exclusions>
                <exclusion>
                    <groupId>org.mongodb</groupId>
                    <artifactId>mongodb-driver-sync</artifactId>
                </exclusion>
            </exclusions>
        </dependency>


        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongodb-driver-sync</artifactId>
            <version>5.5.1</version>
        </dependency>

        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongodb-driver-core</artifactId>
            <version>5.5.1</version>
        </dependency>
        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>bson</artifactId>
            <version>5.5.1</version>
        </dependency>

2
  • Start simple: Are you sure you can connect properly to this mongodb instance with required authentication outside of Spark? Second: Try a simple Spark setup. 5 lines; just get the basic config, run df = ..format("mongodb").load() and do df.show().limit(2) or similar. ` Commented Aug 18 at 13:04
  • Connection to the mongodb instance appears to be happening well. The data from the mongodb collections is also being read well when connected directly through morphia and mongo-java-driver. Thanks for the response; will start by using a simple setup as suggested. Commented Aug 29 at 20:03

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.