1

I am new to Spark and Scala. I am trying to create a Dataframe from a JSONArray. Below is my code:

 public class JSONParse{
    public JSONArray actionItems() {
        JSONParser parser = new JSONParser();
        JSONArray results = null;
        try {
            JSONObject obj = (JSONObject) parser.parse(new FileReader("/data/home/actionitems.json"));
            JSONObject obj2 = (JSONObject) obj.get("d");
            results = (JSONArray) obj2.get("results");
            System.out.println(results);

        } catch (Exception e) {
            e.printStackTrace();
        }
        return results;
    }
    }

object driver {
  val parse = new JsonParse
  val conf = new SparkConf().setAppName("test")
  val sc = new SparkContext(conf)
  sc.setLogLevel("ERROR")
  val hiveContext = new HiveContext(sc)
  val sqlContext = new SQLContext(sc)

  def main(args: Array[String]): Unit = {
    val actionItemsRDD = sc.parallelize(Seq(parse.actionItems.toString))
    val df: DataFrame = hiveContext.read.json(actionItemsRDD)
    df.show
    println("number of records: "+df.count)
    }
}

The Java class JsonParse reads the json from a file and returns the JSONArray to the scala Object driver. In driver , I convert the Json String to an RDD and then create the Dataframe using hiveContext.read.json(actionItemsRDD). I build using maven and there are no build errors.

However, when I run the jar, I get this error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.json(Lorg/apache/spark/rdd/RDD;)Lorg/apache/spark/sql/Dataset;

It throws the exception at the hiveContext.read.json line. I've done this before and had no issues. I am also using the same dependencies from my previous attempt. Below is my pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>json</groupId>
  <artifactId>test</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>${project.artifactId}</name>
  <build>
    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.2.2</version>
        <executions>
          <execution>
            <id>scala-compile-first</id>
            <phase>process-resources</phase>
            <goals>
              <goal>compile</goal>
            </goals>
          </execution>

          <execution>
            <id>scala-test-compile</id>
            <phase>process-test-resources</phase>
            <goals>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.4.1</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <relocations>
                <relocation>
                  <pattern>org.apache.http</pattern>
                  <shadedPattern>org.shaded.apache.http</shadedPattern>
                </relocation>
              </relocations>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                    <excludes>
                        <exclude>META-INF/*.SF</exclude>
                        <exclude>META-INF/*.DSA</exclude>
                        <exclude>META-INF/*.RSA</exclude>
                    </excludes>
                </filter>
              </filters>
              <shadedArtifactAttached>true</shadedArtifactAttached>
              <shadedClassifierName>shaded</shadedClassifierName>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
  <dependencies>
        <dependency>
            <groupId>com.databricks</groupId>
            <artifactId>spark-csv_2.11</artifactId>
            <version>1.4.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.10</artifactId>
            <version>1.6.0</version>                
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-api</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>1.6.0</version>             
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-api</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>jcl-over-slf4j</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.10.6</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.jodd/jodd -->
        <dependency>
            <groupId>org.jodd</groupId>
            <artifactId>jodd</artifactId>
            <version>3.4.0</version>
            <type>pom</type>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.json/json -->
        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20170516</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.googlecode.json-simple/json-simple -->
        <dependency>
            <groupId>com.googlecode.json-simple</groupId>
            <artifactId>json-simple</artifactId>
            <version>1.1.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.threeten/threetenbp -->
        <dependency>
            <groupId>org.threeten</groupId>
            <artifactId>threetenbp</artifactId>
            <version>1.3.3</version>
        </dependency>
    </dependencies>
</project>

Not sure why this error is showing up and I'm not able to resolve it. Any help would be appreciated. Thank you!

1 Answer 1

1

First point - don't parse data yourself. Spark has built-in support for JSON:

val df = spark.read.json("file:///data/home/actionitems.json")
val newDataset = df.select("d.results")

You can also use built-in function like from_json if you have any JSON in JSON ;)

If your JSON is not line-by-line - one object per line - use the multiLine option and set it to true, then your Dataset will have only one column

Second point - it looks like you have wrong version of Spark on your cluster and because of that Spark can't see proper method

Third point - it's better to update to Spark at least 2.2, it has many improvement

Fourth point - you have mismatched Scala versions, all components should use the same Scala. You declare 2.10 one time, 2.11 in other dependences

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the answer. I am parsing the json because the JSONArray that I need to convert to Dataframe is in my JSONObject which is in the file. I am extracting the JSONArray from the parsed JSONObject. I have similar code in another application and it is running fine without any errors. This is a new application I am trying to develop and I am not sure why it's giving me a problem when my older application has no issues. I am also using the same dependencies. And as for the update to Spark 2.2, It's not possible for me to do it.
@Hemanth The real answer is second point. However, still parsing can be done via Spark, I've modified the answer to provide example
I tried the code you posted, it fails with this error: org.apache.spark.sql.AnalysisException: cannot resolve 'value.d.results' given input columns: [d];
@user6910411 Thanks, nice catch! I will add it to the list in a minute

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.