0

I have below dataframe.

id,code

1,GSTR

2,GSTR

3,NA

4,NA

5,NA

here GSTR may change it can be anything. i want to replace NA with other string that is present in the same column. 

In this case i want to replace NA with other string that is present in the column i.e GSTR. I tried to use UDFS but being an unknown string. I am not able to figure out.

Note: In this code column there will be only two strings. one will be "NA" and another can be anything in our case GSTR is another string

Expected output

1,GSTR

2,GSTR

3,GSTR

4,GSTR

5,GSTR
2
  • Always code column will have only 2 values, 'NA' and 'some string' ? Commented Jan 5, 2018 at 10:17
  • yes suresh, Always Commented Jan 5, 2018 at 10:19

1 Answer 1

1

we can take the distinct string other than NA and use it,

>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame([(1,'GSTR'),(2,'GSTR'),(3,'NA'),(4,'NA'),(5,'NA')],['id','code'])
>>> df.show()
+---+----+
| id|code|
+---+----+
|  1|GSTR|
|  2|GSTR|
|  3|  NA|
|  4|  NA|
|  5|  NA|
+---+----+
>>> rstr = df.where(df.code != 'NA')[['code']].first().code
>>> df.withColumn('code',F.lit(rstr)).show()
+---+----+
| id|code|
+---+----+
|  1|GSTR|
|  2|GSTR|
|  3|GSTR|
|  4|GSTR|
|  5|GSTR|
+---+----+

Hope this helps.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your input. GSTR can be anywhere not only the first position. can you do anything for that?
@AshSr , code will have only two values and we are taking not NA rows only, which gives only GSTR. There all rows will always have GSTR and take just first value to get the string dynamically.
Okay suresh, suppose i have few more columns like code1 and code2 having same type of date. should i code for each and every column? cant we make that dynamic?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.