Pyspark dataframe explode string column

Question

I am looking for an efficient way to explode the rows in the pyspark dataframe df_input into columns. I dont understand that format '@{name...}' and don't know where to start in order to decode it. Thanks for help!

df_input = sqlContext.createDataFrame(
    [
        (1, '@{name= Hans; age= 45}'), 
        (2, '@{name= Jeff; age= 15}'), 
        (3, '@{name= Elona; age= 23}')
    ], 
    ('id', 'firstCol')
      )

expected result:

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1| Hans| 45|
|  2| Jeff| 15|
|  3|Elona| 23|
+---+-----+---+

What data types do you see when you use df.printSchema() on your real dataframe? — ZygD
– ZygD, Commented Jun 14, 2022 at 15:15

blackbishop · Accepted Answer · 2022-06-14 16:05:05Z

2

Convert the string into map type using str_to_map function, explode it then pivot the keys:

from pyspark.sql import functions as F

df = df_input.selectExpr(
    "id",
    "explode(str_to_map(regexp_replace(firstCol, '[@{}]', ''), ';', '='))"
).groupby("id").pivot("key").agg(F.first("value"))

df.show()
#+---+----+------+
#|id | age|name  |
#+---+----+------+
#|1  | 45 | Hans |
#|2  | 15 | Jeff |
#|3  | 23 | Elona|
#+---+----+------+

answered Jun 14, 2022 at 16:05

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Matt Andruff · Accepted Answer · 2022-06-14 15:45:35Z

1

from  pyspark.sql.functions import regexp_extract

df_input.select( 
 df_input.id, #id
 regexp_extract( #use regex
  df_input.firstCol, #on firstCol
  '\s(.*);', #find a space character then capture a (group of text) until you find a ';'
  1 # use capture group 1 as text
 ).alias("name"), 
 regexp_extract(
  df_input.firstCol, 
  '\s.*\s(.*)}', #find the second space then capture a (group  of text) until you find a '}'
  1 # use capture group 1 as text
 ).alias("age") 
).show()
+---+-----+---+
| id| name|age|
+---+-----+---+
|  1| Hans| 45|
|  2| Jeff| 15|
|  3|Elona| 23|
+---+-----+---+

answered Jun 14, 2022 at 15:45

Matt Andruff

5,1901 gold badge7 silver badges25 bronze badges

Collectives™ on Stack Overflow

Pyspark dataframe explode string column

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related