0

I have a pyspark dataframe where one column has a XML inside. Each XML in a row looks like that, some have 2 entries, some 3 and 4:

Example of one row entry:

<?xml version="1.0" encoding="utf-8"?> <goals> <goal id="445" name="xxxy" sex="F" /> <goal id="2468" name="qwerzui" sex="F" /> <goal id="4334" name="foo" sex="M" /> <goal id="15" name="fooh" sex="F" /> </goals>

I need to parse the values out of it for goal id, name and sex and create columns out of it.

Since an XML can have several entries, it is difficult to generate a fixed number of columns from it. My idea was to create a column for each of these attributes (means add 3 columns to the dataframe), where then lists are inside.

in this example expand the pyspark dataframe with columns:

goal id name sex
[445,2468,4334,15] [xxxy,qwerzui,foo,fooh] [F,F,M,F]

I was thinking of a UDF that iterates through the XML column and creates the columns from it. What do you think about this, does it make sense to do it this way? To also be able to analyze later. It is actually not common to have lists in columns.

I tried it via:

import xml.etree.ElementTree as ET

root = ET.fromstring(string)

and with the following i can access the values inside, but i am not able to put it in a proper udf function to expand my pyspark dataframe.

for child in root:
  print(child.tag, child.attrib)
  
for child in root:
  print(child.attrib['age'],child.attrib['sex']) 

Unfortunately the other solutions from stackoverflow could not help me, so I hope for a solution for my problem

2 Answers 2

2

Use xpath. No need to use UDF and should give a better performance.

df2 = df.selectExpr(
    ["xpath(col, 'goals/goal/@%s') as %s" % (c,c) for c in ['id', 'name', 'sex']]
)

df2.show(20,0)
+---------------------+--------------------------+------------+
|id                   |name                      |sex         |
+---------------------+--------------------------+------------+
|[445, 2468, 4334, 15]|[xxxy, qwerzui, foo, fooh]|[F, F, M, F]|
+---------------------+--------------------------+------------+

If you want to add them as new columns, do

df2 = df.selectExpr('*',
    *["xpath(col, 'goals/goal/@%s') as %s" % (c,c) for c in ['id', 'name', 'sex']]
)
Sign up to request clarification or add additional context in comments.

3 Comments

This is really much better than an udf. Is it possible to attach df2 directly to df? Alternatively I append the new columns from df2 to df via withColumn. thanks! @mck
what is col here that is passes to xpath. I am getting error : cannot resolve col given input columns :
it is the column that contains the xml
1

The code below generates the 3 lists

import xml.etree.ElementTree as ET

XML = '''<?xml version="1.0" encoding="utf-8"?> <goals> <goal id="445" name="xxxy" sex="F" /> 
                                                        <goal id="2468" name="qwerzui" sex="F" /> <goal id="4334" name="foo" sex="M" /> 
                                                        <goal id="15" name="fooh" sex="F" /> 
                                                </goals>
'''
final = []
attributes = ['id', 'name', 'sex']
root = ET.fromstring(XML)
for attrib in attributes:
    final.append([g.attrib[attrib] for g in root.findall('goal')])
print(final)

output

[['445', '2468', '4334', '15'], ['xxxy', 'qwerzui', 'foo', 'fooh'], ['F', 'F', 'M', 'F']]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.