Dropping multiple columns of Spark DataFrame in Java

Question

I'm using Spark 1.5.3, I tried separating the even and odd columns in a spark dataFrame,

String filePath = "/home/inputData.txt";
DataFrame inputDataFrame = sql.read().format("com.databricks.spark.csv") //No I18N
    .option("inferSchema", "true") //No I18N
    .option("delimiter", ",") //No I18N
    .option("header", "false") //No I18N
    .load(filePath);

inputDataFrame.show();

List<String> evenColumns = Arrays.asList("C0", "C2", "C4", "C6", "C8", "C10", "C12");
DataFrame oddDataFrame = inputDataFrame.na().drop(JavaConversions.asScalaBuffer(evenColumns));
DataFrame evenDataFrame = inputDataFrame.selectExpr(JavaConversions.asScalaBuffer(evenColumns));

evenDataFrame.show();
oddDataFrame.show();

Output for the above code :

inputDataFrame:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| C0| C1| C2| C3| C4| C5| C6| C7| C8| C9|C10|C11|C12|C13|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  0|  0|  0|  0|  1|  0|  0|  0|  2|  3|  2|  2|  0|  5|
|  1|  5|  6|  0| 14|  0|  5|  0| 95|  2|120|  0|  0|  9|
|  1|  6|  1|  0|  3|  0|  4|  0| 21| 22| 11|  0|  0| 23|
|  1|  0|  1|  0|  1|  0|  4|  0|  1|  4|  2|  0|  0|  5|
|  1| 37|  9|  0| 19|  0| 31|  0| 87|  9|108|  0|  0|170|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

evenDataFrame:

+---+---+---+---+---+---+---+
| C0| C2| C4| C6| C8|C10|C12|
+---+---+---+---+---+---+---+
|  0|  0|  1|  0|  2|  2|  0|
|  1|  6| 14|  5| 95|120|  0|
|  1|  1|  3|  4| 21| 11|  0|
|  1|  1|  1|  4|  1|  2|  0|
|  1|  9| 19| 31| 87|108|  0|
+---+---+---+---+---+---+---+

oddDataFrame:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| C0| C1| C2| C3| C4| C5| C6| C7| C8| C9|C10|C11|C12|C13|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  0|  0|  0|  0|  1|  0|  0|  0|  2|  3|  2|  2|  0|  5|
|  1|  5|  6|  0| 14|  0|  5|  0| 95|  2|120|  0|  0|  9|
|  1|  6|  1|  0|  3|  0|  4|  0| 21| 22| 11|  0|  0| 23|
|  1|  0|  1|  0|  1|  0|  4|  0|  1|  4|  2|  0|  0|  5|
|  1| 37|  9|  0| 19|  0| 31|  0| 87|  9|108|  0|  0|170|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Dropping the evenColumns isn't performed in the above dataFrame. What I'm doing wrong? Expected output is

+---+---+---+---+---+---+---+
| C1| C3| C5| C7| C9|C11|C13|
+---+---+---+---+---+---+---+
|  0|  0|  0|  0|  3|  2|  5|
|  5|  0|  0|  0|  2|  0|  9|
|  6|  0|  0|  0| 22|  0| 23|
|  0|  0|  0|  0|  4|  0|  5|
| 37|  0|  0|  0|  9|  0|170|
+---+---+---+---+---+---+---+

@Shaido.. Yes I solved the problem, but it is a work around approach. — Manikandan Balasubramanian
– Manikandan Balasubramanian, Commented Jun 28, 2018 at 9:07
Just curious, any reason the other answer (maybe I should note it's written by me) did not work in your case? — Shaido
– Shaido, Commented Jun 28, 2018 at 9:13
@Shaido I think your answer inputDataFrame.drop(JavaConversions.asScalaBuffer(evenColumns)); will work for Dataset<Row> Spark 2.x — Manikandan Balasubramanian
– Manikandan Balasubramanian, Commented Jun 28, 2018 at 10:32

Shaido · Accepted Answer · 2018-06-04 06:03:51Z

3

na().drop is used to drop rows containing null or NaN values. To drop columns regardless of what they contain, simply use drop(). In this case it should be:

inputDataFrame.drop(JavaConversions.asScalaBuffer(evenColumns));

answered Jun 4, 2018 at 6:03

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Manikandan Balasubramanian · Accepted Answer · 2020-02-10 09:26:14Z

1

This is an indirect/ work around approach

public static DataFrame drop(DataFrame dataFrame, List<String> dropCol) {
List<String> colname = Arrays.stream(dataFrame.columns()).filter(col -> !dropCol.contains(col)).collect(Collectors.toList());
// colname list will have the names of the cols except the ones to be dropped.
return dataFrame.selectExpr(JavaConversions.asScalaBuffer(colname));
}

edited Feb 10, 2020 at 9:26

answered Jun 28, 2018 at 9:10

Manikandan Balasubramanian

1,1094 gold badges15 silver badges28 bronze badges

Collectives™ on Stack Overflow

Dropping multiple columns of Spark DataFrame in Java

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related