extract strings from heavily nested dataframe

Question

The structure of my Dataframe is dynamic and heavily nested. I would like to extract the string value of the fields called text.

df.select(df['back')
DataFrame[back: struct<attrs:struct<version:int>,content:array<struct<content:array<struct<content:array<struct<type:string,content:array<struct<type:string,text:string>>>>,text:string,type:string>>,type:string>>,type:string>]

df.select(df['back').show()
|[[1], [[[[, Vernetzte Komponenten innerhalb eines Netzwerks mit verschiedenen Hardware-Rechnern realisiert Ja., text]], paragraph]], doc]    

df.select(df['back').collect()
Row(back=Row(attrs=Row(version=1), content=[Row(content=[Row(content=None, text='Vernetzte Komponenten innerhalb eines Netzwerks mit verschiedenen Hardware-Rechnern realisiert Ja.', type='text')], type='paragraph')], type='doc'))

Kafels · Accepted Answer · 2021-08-10 13:00:28Z

1

Since the dataframe is dynamic, to iterate over a nested column I recommend to use UDF:

import pyspark.sql.functions as f

@f.udf()
def extract_texts(elements):
  texts = []
  
  while len(elements) > 0:
    element = elements.pop(0)
    if 'content' in element and element['content'] is not None:
      elements.extend(element['content'])
      
    if 'text' in element:
      texts.append(element['text'])

  return texts

new_df = (df
          .withColumn('texts', extract_texts('back.content')))

new_df.show()
# +--------------------+--------------------+
# |                back|               texts|
# +--------------------+--------------------+
# |{{1}, [{[{null, V...|[Vernetzte Kompon...|
# +--------------------+--------------------+

answered Aug 10, 2021 at 13:00

Kafels

4,0891 gold badge18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

extract strings from heavily nested dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related