1

I'm using Spark 1.5.3, I tried separating the even and odd columns in a spark dataFrame,

String filePath = "/home/inputData.txt";
DataFrame inputDataFrame = sql.read().format("com.databricks.spark.csv") //No I18N
    .option("inferSchema", "true") //No I18N
    .option("delimiter", ",") //No I18N
    .option("header", "false") //No I18N
    .load(filePath);

inputDataFrame.show();

List<String> evenColumns = Arrays.asList("C0", "C2", "C4", "C6", "C8", "C10", "C12");
DataFrame oddDataFrame = inputDataFrame.na().drop(JavaConversions.asScalaBuffer(evenColumns));
DataFrame evenDataFrame = inputDataFrame.selectExpr(JavaConversions.asScalaBuffer(evenColumns));

evenDataFrame.show();
oddDataFrame.show();

Output for the above code :

inputDataFrame:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| C0| C1| C2| C3| C4| C5| C6| C7| C8| C9|C10|C11|C12|C13|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  0|  0|  0|  0|  1|  0|  0|  0|  2|  3|  2|  2|  0|  5|
|  1|  5|  6|  0| 14|  0|  5|  0| 95|  2|120|  0|  0|  9|
|  1|  6|  1|  0|  3|  0|  4|  0| 21| 22| 11|  0|  0| 23|
|  1|  0|  1|  0|  1|  0|  4|  0|  1|  4|  2|  0|  0|  5|
|  1| 37|  9|  0| 19|  0| 31|  0| 87|  9|108|  0|  0|170|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

evenDataFrame:

+---+---+---+---+---+---+---+
| C0| C2| C4| C6| C8|C10|C12|
+---+---+---+---+---+---+---+
|  0|  0|  1|  0|  2|  2|  0|
|  1|  6| 14|  5| 95|120|  0|
|  1|  1|  3|  4| 21| 11|  0|
|  1|  1|  1|  4|  1|  2|  0|
|  1|  9| 19| 31| 87|108|  0|
+---+---+---+---+---+---+---+

oddDataFrame:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| C0| C1| C2| C3| C4| C5| C6| C7| C8| C9|C10|C11|C12|C13|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  0|  0|  0|  0|  1|  0|  0|  0|  2|  3|  2|  2|  0|  5|
|  1|  5|  6|  0| 14|  0|  5|  0| 95|  2|120|  0|  0|  9|
|  1|  6|  1|  0|  3|  0|  4|  0| 21| 22| 11|  0|  0| 23|
|  1|  0|  1|  0|  1|  0|  4|  0|  1|  4|  2|  0|  0|  5|
|  1| 37|  9|  0| 19|  0| 31|  0| 87|  9|108|  0|  0|170|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Dropping the evenColumns isn't performed in the above dataFrame. What I'm doing wrong? Expected output is

+---+---+---+---+---+---+---+
| C1| C3| C5| C7| C9|C11|C13|
+---+---+---+---+---+---+---+
|  0|  0|  0|  0|  3|  2|  5|
|  5|  0|  0|  0|  2|  0|  9|
|  6|  0|  0|  0| 22|  0| 23|
|  0|  0|  0|  0|  4|  0|  5|
| 37|  0|  0|  0|  9|  0|170|
+---+---+---+---+---+---+---+
4
  • 1
    Did you manage to solve the problem? Commented Jun 22, 2018 at 9:48
  • @Shaido.. Yes I solved the problem, but it is a work around approach. Commented Jun 28, 2018 at 9:07
  • Just curious, any reason the other answer (maybe I should note it's written by me) did not work in your case? Commented Jun 28, 2018 at 9:13
  • @Shaido I think your answer inputDataFrame.drop(JavaConversions.asScalaBuffer(evenColumns)); will work for Dataset<Row> Spark 2.x Commented Jun 28, 2018 at 10:32

2 Answers 2

3

na().drop is used to drop rows containing null or NaN values. To drop columns regardless of what they contain, simply use drop(). In this case it should be:

inputDataFrame.drop(JavaConversions.asScalaBuffer(evenColumns));
Sign up to request clarification or add additional context in comments.

Comments

1

This is an indirect/ work around approach

public static DataFrame drop(DataFrame dataFrame, List<String> dropCol) {
List<String> colname = Arrays.stream(dataFrame.columns()).filter(col -> !dropCol.contains(col)).collect(Collectors.toList());
// colname list will have the names of the cols except the ones to be dropped.
return dataFrame.selectExpr(JavaConversions.asScalaBuffer(colname));
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.