Spark/Scala - Error creating DataFrame from Json: java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.json

Question

I am new to Spark and Scala. I am trying to create a Dataframe from a JSONArray. Below is my code:

 public class JSONParse{
    public JSONArray actionItems() {
        JSONParser parser = new JSONParser();
        JSONArray results = null;
        try {
            JSONObject obj = (JSONObject) parser.parse(new FileReader("/data/home/actionitems.json"));
            JSONObject obj2 = (JSONObject) obj.get("d");
            results = (JSONArray) obj2.get("results");
            System.out.println(results);

        } catch (Exception e) {
            e.printStackTrace();
        }
        return results;
    }
    }

object driver {
  val parse = new JsonParse
  val conf = new SparkConf().setAppName("test")
  val sc = new SparkContext(conf)
  sc.setLogLevel("ERROR")
  val hiveContext = new HiveContext(sc)
  val sqlContext = new SQLContext(sc)

  def main(args: Array[String]): Unit = {
    val actionItemsRDD = sc.parallelize(Seq(parse.actionItems.toString))
    val df: DataFrame = hiveContext.read.json(actionItemsRDD)
    df.show
    println("number of records: "+df.count)
    }
}

The Java class JsonParse reads the json from a file and returns the JSONArray to the scala Object driver. In driver , I convert the Json String to an RDD and then create the Dataframe using hiveContext.read.json(actionItemsRDD). I build using maven and there are no build errors.

However, when I run the jar, I get this error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.json(Lorg/apache/spark/rdd/RDD;)Lorg/apache/spark/sql/Dataset;

It throws the exception at the hiveContext.read.json line. I've done this before and had no issues. I am also using the same dependencies from my previous attempt. Below is my pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>json</groupId>
  <artifactId>test</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>${project.artifactId}</name>
  <build>
    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.2.2</version>
        <executions>
          <execution>
            <id>scala-compile-first</id>
            <phase>process-resources</phase>
            <goals>
              <goal>compile</goal>
            </goals>
          </execution>

          <execution>
            <id>scala-test-compile</id>
            <phase>process-test-resources</phase>
            <goals>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.4.1</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <relocations>
                <relocation>
                  <pattern>org.apache.http</pattern>
                  <shadedPattern>org.shaded.apache.http</shadedPattern>
                </relocation>
              </relocations>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                    <excludes>
                        <exclude>META-INF/*.SF</exclude>
                        <exclude>META-INF/*.DSA</exclude>
                        <exclude>META-INF/*.RSA</exclude>
                    </excludes>
                </filter>
              </filters>
              <shadedArtifactAttached>true</shadedArtifactAttached>
              <shadedClassifierName>shaded</shadedClassifierName>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
  <dependencies>
        <dependency>
            <groupId>com.databricks</groupId>
            <artifactId>spark-csv_2.11</artifactId>
            <version>1.4.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.10</artifactId>
            <version>1.6.0</version>                
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-api</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>1.6.0</version>             
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-api</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>jcl-over-slf4j</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.10.6</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.jodd/jodd -->
        <dependency>
            <groupId>org.jodd</groupId>
            <artifactId>jodd</artifactId>
            <version>3.4.0</version>
            <type>pom</type>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.json/json -->
        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20170516</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.googlecode.json-simple/json-simple -->
        <dependency>
            <groupId>com.googlecode.json-simple</groupId>
            <artifactId>json-simple</artifactId>
            <version>1.1.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.threeten/threetenbp -->
        <dependency>
            <groupId>org.threeten</groupId>
            <artifactId>threetenbp</artifactId>
            <version>1.3.3</version>
        </dependency>
    </dependencies>
</project>

Not sure why this error is showing up and I'm not able to resolve it. Any help would be appreciated. Thank you!

T. Gawęda · Accepted Answer · 2018-01-23 21:38:07Z

1

First point - don't parse data yourself. Spark has built-in support for JSON:

val df = spark.read.json("file:///data/home/actionitems.json")
val newDataset = df.select("d.results")

You can also use built-in function like from_json if you have any JSON in JSON ;)

If your JSON is not line-by-line - one object per line - use the multiLine option and set it to true, then your Dataset will have only one column

Second point - it looks like you have wrong version of Spark on your cluster and because of that Spark can't see proper method

Third point - it's better to update to Spark at least 2.2, it has many improvement

Fourth point - you have mismatched Scala versions, all components should use the same Scala. You declare 2.10 one time, 2.11 in other dependences

edited Jan 23, 2018 at 21:38

answered Jan 23, 2018 at 16:52

T. Gawęda

16.1k5 gold badges51 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Hemanth Over a year ago

Thanks for the answer. I am parsing the json because the JSONArray that I need to convert to Dataframe is in my JSONObject which is in the file. I am extracting the JSONArray from the parsed JSONObject. I have similar code in another application and it is running fine without any errors. This is a new application I am trying to develop and I am not sure why it's giving me a problem when my older application has no issues. I am also using the same dependencies. And as for the update to Spark 2.2, It's not possible for me to do it.

T. Gawęda Over a year ago

@Hemanth The real answer is second point. However, still parsing can be done via Spark, I've modified the answer to provide example

Hemanth Over a year ago

I tried the code you posted, it fails with this error: org.apache.spark.sql.AnalysisException: cannot resolve 'value.d.results' given input columns: [d];

T. Gawęda Over a year ago

@user6910411 Thanks, nice catch! I will add it to the list in a minute

Collectives™ on Stack Overflow

Spark/Scala - Error creating DataFrame from Json: java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.json

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related