Pyspark dataframe with XML column and multiple values inside: Extract columns out of it

Question

I have a pyspark dataframe where one column has a XML inside. Each XML in a row looks like that, some have 2 entries, some 3 and 4:

Example of one row entry:

<?xml version="1.0" encoding="utf-8"?> <goals> <goal id="445" name="xxxy" sex="F" /> <goal id="2468" name="qwerzui" sex="F" /> <goal id="4334" name="foo" sex="M" /> <goal id="15" name="fooh" sex="F" /> </goals>

I need to parse the values out of it for goal id, name and sex and create columns out of it.

Since an XML can have several entries, it is difficult to generate a fixed number of columns from it. My idea was to create a column for each of these attributes (means add 3 columns to the dataframe), where then lists are inside.

in this example expand the pyspark dataframe with columns:

goal id	name	sex
[445,2468,4334,15]	[xxxy,qwerzui,foo,fooh]	[F,F,M,F]

I was thinking of a UDF that iterates through the XML column and creates the columns from it. What do you think about this, does it make sense to do it this way? To also be able to analyze later. It is actually not common to have lists in columns.

I tried it via:

import xml.etree.ElementTree as ET

root = ET.fromstring(string)

and with the following i can access the values inside, but i am not able to put it in a proper udf function to expand my pyspark dataframe.

for child in root:
  print(child.tag, child.attrib)
  
for child in root:
  print(child.attrib['age'],child.attrib['sex'])

Unfortunately the other solutions from stackoverflow could not help me, so I hope for a solution for my problem

mck · Accepted Answer · 2020-12-19 14:34:40Z

2

Use xpath. No need to use UDF and should give a better performance.

df2 = df.selectExpr(
    ["xpath(col, 'goals/goal/@%s') as %s" % (c,c) for c in ['id', 'name', 'sex']]
)

df2.show(20,0)
+---------------------+--------------------------+------------+
|id                   |name                      |sex         |
+---------------------+--------------------------+------------+
|[445, 2468, 4334, 15]|[xxxy, qwerzui, foo, fooh]|[F, F, M, F]|
+---------------------+--------------------------+------------+

If you want to add them as new columns, do

df2 = df.selectExpr('*',
    *["xpath(col, 'goals/goal/@%s') as %s" % (c,c) for c in ['id', 'name', 'sex']]
)

edited Dec 19, 2020 at 14:34

answered Dec 19, 2020 at 13:50

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Meiiso Over a year ago

This is really much better than an udf. Is it possible to attach df2 directly to df? Alternatively I append the new columns from df2 to df via withColumn. thanks! @mck

Nikunj Kakadiya Over a year ago

what is col here that is passes to xpath. I am getting error : cannot resolve col given input columns :

mck Over a year ago

it is the column that contains the xml

balderman · Accepted Answer · 2020-12-19 13:49:18Z

The code below generates the 3 lists

import xml.etree.ElementTree as ET

XML = '''<?xml version="1.0" encoding="utf-8"?> <goals> <goal id="445" name="xxxy" sex="F" /> 
                                                        <goal id="2468" name="qwerzui" sex="F" /> <goal id="4334" name="foo" sex="M" /> 
                                                        <goal id="15" name="fooh" sex="F" /> 
                                                </goals>
'''
final = []
attributes = ['id', 'name', 'sex']
root = ET.fromstring(XML)
for attrib in attributes:
    final.append([g.attrib[attrib] for g in root.findall('goal')])
print(final)

output

[['445', '2468', '4334', '15'], ['xxxy', 'qwerzui', 'foo', 'fooh'], ['F', 'F', 'M', 'F']]

Collectives™ on Stack Overflow

Pyspark dataframe with XML column and multiple values inside: Extract columns out of it

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related