5
  • Spark Version : 1.6.2
  • Java Version: 7

I have a List<String> data. Something like:

[[dev, engg, 10000], [karthik, engg, 20000]..]

I know schema for this data.

name (String)
degree (String)
salary (Integer)

I tried:

JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas);
DataFrame df = sqlContext.read().json(data);
df.printSchema();
df.show(false);

Output:

root
 |-- _corrupt_record: string (nullable = true)


+-----------------------------+
|_corrupt_record              |
+-----------------------------+
|[dev, engg, 10000]           |
|[karthik, engg, 20000]       |
+-----------------------------+

Because List<String> is not a proper JSON.

Do I need to create a proper JSON or is there any other way to do this?

2
  • Why don't you create Java bean class containing those properties and then you can have ArrayList<beanClass Data> and then you can create dataframe using sqlContext.createDataFrame(List<?> data, Class<?> beanClass). Commented Apr 26, 2017 at 12:58
  • @abaghel creating Java bean class is not possible for every set of data. Commented Apr 26, 2017 at 13:12

3 Answers 3

15

You can create DataFrame from List<String> and then use selectExpr and split to get desired DataFrame.

public class SparkSample{
public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
    JavaSparkContext jsc = new JavaSparkContext(conf);
    SQLContext sqc = new SQLContext(jsc);
    // sample data
    List<String> data = new ArrayList<String>();
    data.add("dev, engg, 10000");
    data.add("karthik, engg, 20000");
    // DataFrame
    DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
    df.printSchema();
    df.show();
    // Convert
    DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
    df1.printSchema();
    df1.show(); 
   }
}

You will get below output.

root
 |-- value: string (nullable = true)

+--------------------+
|               value|
+--------------------+
|    dev, engg, 10000|
|karthik, engg, 20000|
+--------------------+

root
 |-- name: string (nullable = true)
 |-- degree: string (nullable = true)
 |-- salary: string (nullable = true)

+-------+------+------+
|   name|degree|salary|
+-------+------+------+
|    dev|  engg| 10000|
|karthik|  engg| 20000|
+-------+------+------+

The sample data you have provided has empty spaces. If you want to remove space and have the salary type as "integer" then you can use trim and cast function like below.

df1 = df1.select(trim(col("name")).as("name"),trim(col("degree")).‌​as("degree"),trim(co‌​l("salary")).cast("i‌​nteger").as("salary"‌​)); 
Sign up to request clarification or add additional context in comments.

Comments

2
DataFrame createNGramDataFrame(JavaRDD<String> lines) {
 JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
    private static final long serialVersionUID = -4332903997027358601L;

    @Override
    public Row call(String line) throws Exception {
        return RowFactory.create(line.split("\\s+"));
    }
 });
 StructType schema = new StructType(new StructField[] {
        new StructField("words",
                DataTypes.createArrayType(DataTypes.StringType), false,
                Metadata.empty()) });
 DataFrame wordDF = new SQLContext(jsc).createDataFrame(rows, schema);
 // build a bigram language model
 NGram transformer = new NGram().setInputCol("words")
        .setOutputCol("ngrams").setN(2);
 DataFrame ngramDF = transformer.transform(wordDF);
 ngramDF.show(10, false);
 return ngramDF;
}

Comments

1

Task can be completed without JSON, on Scala:

val data = List("dev, engg, 10000", "karthik, engg, 20000")
val intialRdd = sparkContext.parallelize(data)
val splittedRDD = intialRdd.map(current => {
  val array = current.split(",")
  (array(0), array(1), array(2))
})
import sqlContext.implicits._
val dataframe = splittedRDD.toDF("name", "degree", "salary")
dataframe.show()

Output is:

+-------+------+------+
|   name|degree|salary|
+-------+------+------+
|    dev|  engg| 10000|
|karthik|  engg| 20000|
+-------+------+------+

Note: (array(0), array(1), array(2)) is a Scala Tuple

2 Comments

In addition, the tutorial at spark.apache.org/docs/latest/… shows how you can manually define a schema, instead of using toDF which isn't as reliable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.