I have a Dataframe in Spark and I would like to replace the values of different columns based on a simple regular expression which is if the value ends with "_P" replace it with "1" and if it ends with "_N" then replace it with "-1". There are multiple columns that I need to do the same replacement. I also need to do a casting at the end.
-
2What did you try and why didn't it work?Samy Dindane– Samy Dindane2016-09-13 15:50:05 +00:00Commented Sep 13, 2016 at 15:50
-
I tried df.na.replace(columns, Map(""[a-zA-Z0-9]_P" -> "1", "[a-zA-Z0-9]_N" -> "-1")). It is not working thoughHHH– HHH2016-09-13 16:01:58 +00:00Commented Sep 13, 2016 at 16:01
Add a comment
|
1 Answer
You can do it through expressions like "when('column.endsWith("_P"), lit("1")).when...". The same could be achieved by using regexp_replace. Here's an example using the when:
val myDf = sc.parallelize(Array(
("foo_P", "bar_N", "123"),
("foo_N", "bar_Y", "123"),
("foo", "bar", "123"),
("foo_Y", "bar_XX", "123")
)).toDF("col1", "col2", "col3")
val colsToReplace = Seq("col1", "col2")
import org.apache.spark.sql.Column
val castValues = (colName: String) => {
val col = new Column(colName)
when(col.endsWith("_P"), lit("1"))
.when(col.endsWith("_F"), lit("-1"))
.otherwise(col)
.as(colName)
}
val selectExprs = myDf.columns.diff(colsToReplace).map(new Column(_)) ++ colsToReplace.map(castValues)
myDf.select(selectExprs:_*).show
/*
+----+-----+------+
|col3| col1| col2|
+----+-----+------+
| 123| 1| bar_N|
| 123|foo_N| bar_Y|
| 123| foo| bar|
| 123|foo_Y|bar_XX|
+----+-----+------+
*/
EDIT
By the way, regarding your comment on what you tried: The "df.na" functions is meant to work on rows containing NULL values, so, even if what you tried worked, it would work only on rows containing nulls. Apart from that, the "replace" doesn't work with regular expressions, at least it didn't the last time I checked.
Cheers