0

I'm working with a new Spark project using Java. I have to read some data from the CSV files and these CSVs have an array of floats and I do not know how I can get this array in my dataset.

I'm reading from this CSV:

[CSV data image][1] https://imgur.com/a/PdrMhev

And I'm trying to get the data in this way:

Dataset<Row> typedTrainingData = sparkSession.sql("SELECT CAST(IDp as String) IDp, CAST(Instt as String) Instt, CAST(dataVector as String) dataVector FROM TRAINING_DATA");

And I get this:

root
 |-- IDp: string (nullable = true)
 |-- Instt: string (nullable = true)
 |-- dataVector: string (nullable = true)

+-------+-------------+-----------------+
|    IDp|        Instt|       dataVector|
+-------+-------------+-----------------+
|    p01|      V11apps|-0.41,-0.04,0.1..|
|    p02|      V21apps|-1.50,-1.50,-1...|
+-------+-------------+-----------------+

As you can see in the schema, I read the array as a String but I want to get as array. Recommendations?

I want to use some Machine Learning algorithms of MLlib in this data loaded, for that reason I want to get the data as array.

Thank you guys!!!!!!!!

4
  • Could you show your CSV file example. Commented Dec 14, 2018 at 8:13
  • CSV format doesn't support arrays. So you just want to construct an array using with column from your dataVector string Commented Dec 14, 2018 at 8:37
  • @BSeitkazin yes, of course. Edited in the main post. Commented Dec 14, 2018 at 8:47
  • @BSeitkazin StackOverFlow don't let me put a photo so I have put a link to show you how is my CSV. Commented Dec 14, 2018 at 8:59

1 Answer 1

2

first define your schema,

StructType customStructType = new StructType();
        customStructType = customStructType.add("_c0", DataTypes.StringType, false);
        customStructType = customStructType.add("_c1", DataTypes.StringType, false);
        customStructType = customStructType.add("_c2", DataTypes.createArrayType(DataTypes.LongType), false);

then you can map your df to the new schema,

    Dataset<Row> newDF = oldDF.map((MapFunction<Row, Row>) row -> {

        String strings[] = row.getString(3).split(","); 
        long[] result = new long[strings.length];
        for (int i = 0; i < strings.length; i++)
        result[i] = Long.parseLong(strings[i]);

        return RowFactory.create(row.getString(0),row.getString(1),result);
    }, RowEncoder.apply(customStructType));
Sign up to request clarification or add additional context in comments.

3 Comments

thank you for your response! I'm trying to use this but I have the next error when I'm trying to do it: java.base/java.lang.String cannot be cast to java.base/java.lang.Long What could be? Thanks @Mahmoud !!
thank you again! But doesn't work for me, it give me this error: Caused by: java.lang.NumberFormatException: For input string: "-0.41" at java.base/java.lang.NumberFormatException.forInputString Don't know what is happening! Have you try this code??
you just need to cast your Strings to Long. use your own code for that, the answer I wrote was for reading df in a custom schema

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.