PySpark dataframe to_json() function

Question

I have a dataframe like below,

>>> df.show(10,False)
+-----+----+---+------+
|id   |name|age|salary|
+-----+----+---+------+
|10001|alex|30 |75000 |
|10002|bob |31 |80000 |
|10003|deb |31 |80000 |
|10004|john|33 |85000 |
|10005|sam |30 |75000 |
+-----+----+---+------+

Converting the entire row of df into one new column "jsonCol",

>>> newDf1 = df.withColumn("jsonCol", to_json(struct([df[x] for x in df.columns])))
>>> newDf1.show(10,False)
+-----+----+---+------+--------------------------------------------------------+
|id   |name|age|salary|jsonCol                                                 |
+-----+----+---+------+--------------------------------------------------------+
|10001|alex|30 |75000 |{"id":"10001","name":"alex","age":"30","salary":"75000"}|
|10002|bob |31 |80000 |{"id":"10002","name":"bob","age":"31","salary":"80000"} |
|10003|deb |31 |80000 |{"id":"10003","name":"deb","age":"31","salary":"80000"} |
|10004|john|33 |85000 |{"id":"10004","name":"john","age":"33","salary":"85000"}|
|10005|sam |30 |75000 |{"id":"10005","name":"sam","age":"30","salary":"75000"} |
+-----+----+---+------+--------------------------------------------------------+

Instead of converting the entire row into a JSON string like in the above step I needed a solution to select only few columns based on the value of the field. I have provided a sample condition in the below command.

But when I started using the when function, the resultant JSON string's column names(keys) are gone. Only getting column names by their position, instead of the actual column names(keys)

>>> newDf2 = df.withColumn("jsonCol", to_json(struct([ when(col(x)!="  ",df[x]).otherwise(None) for x in df.columns])))
>>> newDf2.show(10,False)
+-----+----+---+------+---------------------------------------------------------+
|id   |name|age|salary|jsonCol                                                  |
+-----+----+---+------+---------------------------------------------------------+
|10001|alex|30 |75000 |{"col1":"10001","col2":"alex","col3":"30","col4":"75000"}|
|10002|bob |31 |80000 |{"col1":"10002","col2":"bob","col3":"31","col4":"80000"} |
|10003|deb |31 |80000 |{"col1":"10003","col2":"deb","col3":"31","col4":"80000"} |
|10004|john|33 |85000 |{"col1":"10004","col2":"john","col3":"33","col4":"85000"}|
|10005|sam |30 |75000 |{"col1":"10005","col2":"sam","col3":"30","col4":"75000"} |
+-----+----+---+------+---------------------------------------------------------+

I needed to use the when function but to have the results as in newDf1 with actual column names(keys). Can someone help me out?

Anahcolus · Accepted Answer · 2018-04-03 02:30:24Z

9

You have used conditions inside struct function as columns and the condition columns are renamed as col1 col2 .... and thats why you need alias to change the names

from pyspark.sql import functions as F
newDf2 = df.withColumn("jsonCol", F.to_json(F.struct([F.when(F.col(x)!="  ",df[x]).otherwise(None).alias(x) for x in df.columns])))
newDf2.show(truncate=False)

edited Apr 3, 2018 at 2:30

answered Apr 2, 2018 at 2:16

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark dataframe to_json() function

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related