Spark add column to dataframe when reading csv

Question

I have a csv with data shaped like this :

0,0;1,0;2,0;3,0;4,0;6,0;8,0;9,1
4,0;2,1;2,0;1,0;1,0;0,1;3,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;4,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;5,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;6,0;1,0;"BC"

I want to convert it into a dataframe with the last column named "value". I already wrote this code in Scala :

val rawdf = spark.read.format("csv")
                 .option("header", "true")
                 .option("delimiter", ";")
                 .load(CSVPATH)

But I get this result with a rawdf.show(numRows = 4) :

+---+---+---+---+---+---+---+---+
|0,0|1,0|2,0|3,0|4,0|6,0|8,0|9,1|
+---+---+---+---+---+---+---+---+
|4,0|2,1|2,0|1,0|1,0|0,1|3,0|1,0|
|4,0|2,1|2,0|1,0|1,0|0,1|4,0|1,0|
|4,0|2,1|2,0|1,0|1,0|0,1|5,0|1,0|
|4,0|2,1|2,0|1,0|1,0|0,1|6,0|1,0|
+---+---+---+---+---+---+---+---+

How can I add the last column on spark? Should I just write it on the csv file?

For the records, the different options that can be applied to a DataFrameReader : go to line 356 — Baptiste Merliot
– Baptiste Merliot, Commented Aug 22, 2018 at 8:00

Simon · Accepted Answer · 2018-08-22 08:49:04Z

4

Here's a way to do it without changing the CSV file, you set the schema in your code:

val schema = StructType(
    Array(
        StructField("0,0", StringType),
        StructField("1,0", StringType),
        StructField("2,0", StringType),
        StructField("3,0", StringType),
        StructField("4,0", StringType),
        StructField("6,0", StringType),
        StructField("8,0", StringType),
        StructField("9,1", StringType), 
        StructField("X", StringType)
    )
)

val rawdf = 
    spark.read.format("csv")
        .option("header", "true")
        .option("delimiter", ";")
        .schema(schema)
        .load("tmp.csv")

edited Aug 22, 2018 at 8:49

answered Aug 22, 2018 at 8:32

Simon

6,3832 gold badges32 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Baptiste Merliot Over a year ago

It worked perfectly, thanks! Out of Curiosity, what is the true parameter for?

Simon Over a year ago

The third argument is nullable. It's got a default value (true) so I'll remove that since it's not really relevant to this question.

Constantine · Accepted Answer · 2018-08-22 08:09:59Z

0

Spark tries to map the data columns based on available number of header columns that you have if you set :

.option("header", "true")

You can resolve this issue in one of the below 2 ways :

setting header = false
Adding the header column for the last data column or by just adding a semicolon(;) at the end of the header line.

eg:

0,0;1,0;2,0;3,0;4,0;6,0;8,0;9,1;
4,0;2,1;2,0;1,0;1,0;0,1;3,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;4,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;5,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;6,0;1,0;"BC"

OR

0,0;1,0;2,0;3,0;4,0;6,0;8,0;9,1;col_end
4,0;2,1;2,0;1,0;1,0;0,1;3,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;4,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;5,0;1,0;"BC"
4,0;2,1;2,0;1,0;1,0;0,1;6,0;1,0;"BC"

answered Aug 22, 2018 at 8:09

Constantine

1,41615 silver badges19 bronze badges

3 Comments

Baptiste Merliot Over a year ago

So I have to change my csv file if I want a header with the last column?

Baptiste Merliot Over a year ago

Actually setting header = false does not solve the issue, as the first line of the csv (the header) is missing one column, so spark ignores the last column on the next lines.

Constantine Over a year ago

Yes add a column in the header. It should fix it.

Anahcolus · Accepted Answer · 2018-08-22 08:45:01Z

If you don't know the length of lines of data then you can read it as rdd, do some parsings and then create a schema to form a dataframe as below

//read the data as rdd and split the lines 
val rddData = spark.sparkContext.textFile(CSVPATH)
    .map(_.split(";", -1))

//getting the max length from data and creating the schema
val maxlength = rddData.map(x => (x, x.length)).map(_._2).max
val schema = StructType((1 to maxlength).map(x => StructField(s"col_${x}", StringType, true)))

//parsing the data with the maxlength and populating null where no data and using the schema to form dataframe
val rawdf = spark.createDataFrame(rddData.map(x => Row.fromSeq((0 to maxlength-1).map(index => Try(x(index)).getOrElse("null")))), schema)

rawdf.show(false)

which should give you

+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|0,0  |1,0  |2,0  |3,0  |4,0  |6,0  |8,0  |9,1  |null |
|4,0  |2,1  |2,0  |1,0  |1,0  |0,1  |3,0  |1,0  |"BC" |
|4,0  |2,1  |2,0  |1,0  |1,0  |0,1  |4,0  |1,0  |"BC" |
|4,0  |2,1  |2,0  |1,0  |1,0  |0,1  |5,0  |1,0  |"BC" |
|4,0  |2,1  |2,0  |1,0  |1,0  |0,1  |6,0  |1,0  |"BC" |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+

I hope the answer is helpful

Collectives™ on Stack Overflow

Spark add column to dataframe when reading csv

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related