Parsing column containing XML string in Pyspark

Question

I have created a UDF to take an XML string, namespace dictionary, x-path syntax, and key for the key value pair within the XML, and return an array of values to be exploded later using a withColumn(col,explode(col)).

I am now trying to iterate this function over a dataframe with a column containing XML strings in Databricks using Pyspark and create a new column with the returned arrays.

So far I have used this post as the idea for my original approach and read this post on passing a whole row to a withColumn.

I expect my problem is either: with how I am passing the column to the function OR how many arguments my function has.

My function:

from pyspark.sql.functions import udf, struct
from pyspark.sql.types import *
import xml.etree.ElementTree as ET  

def valuelist(xml,path,nsmap,key):
    empty = []
    tree = ET.fromString(xml)
    for value in tree.findall(path,nsmap):
        empty.append(value.get(key))
    return empty

xmlvalue = udf(valuelist, ArrayType(StringType(),True))

Application of the function:

namespaces = {'c' : 'urn:IEEE-1671:2010:Common',
              'sc' : 'urn:IEEE-1636.99:2013:SimicaCommon',
              'tr' : 'urn:IEEE-1636.1:2013:TestResults',
              'trc' : 'urn:IEEE-1636.1:2013:TestResultsCollection',
              'ts' : 'www.ni.com/TestStand/ATMLTestResults/3.0'}
key = 'name'
path = './/tr:Test'

xml = df.withColumn('testnames', xmlvalue('activitydetail',path,namespaces,key)).limit(10)

The XML string is ~44000 characters so I will not include it in the post. I have already prototyped the function in a separate script using one XML record from the dataframe.

Edit: The function works if I only pass the column to the function, I capitalized fromString when it should be fromstring. Still don't know why I can't pass multiple parameters though.

ravi malhotra · Accepted Answer · 2020-05-14 11:42:40Z

0

The problem seems to be because you are passing strings or constants to the UDF which you cannot do without using lit function.

In spark 2.2 there are two ways to add constant value in a column in DataFrame:

1) Using lit

2) Using typedLit

The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map

To keep it simple you need a Column (can be a one created using lit but it is not the only option) when JVM counterpart expects a column and there is no internal conversion in a Python wrapper or you wan to call a Column specific method.

example:

from pyspark.sql.functions import lit
df = df.withColumn("Today's Date", lit(datetime.now()))

answered May 14, 2020 at 11:42

ravi malhotra

7335 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing column containing XML string in Pyspark

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related