I have created a UDF to take an XML string, namespace dictionary, x-path syntax, and key for the key value pair within the XML, and return an array of values to be exploded later using a withColumn(col,explode(col)).
I am now trying to iterate this function over a dataframe with a column containing XML strings in Databricks using Pyspark and create a new column with the returned arrays.
So far I have used this post as the idea for my original approach and read this post on passing a whole row to a withColumn.
I expect my problem is either: with how I am passing the column to the function OR how many arguments my function has.
My function:
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import *
import xml.etree.ElementTree as ET
def valuelist(xml,path,nsmap,key):
empty = []
tree = ET.fromString(xml)
for value in tree.findall(path,nsmap):
empty.append(value.get(key))
return empty
xmlvalue = udf(valuelist, ArrayType(StringType(),True))
Application of the function:
namespaces = {'c' : 'urn:IEEE-1671:2010:Common',
'sc' : 'urn:IEEE-1636.99:2013:SimicaCommon',
'tr' : 'urn:IEEE-1636.1:2013:TestResults',
'trc' : 'urn:IEEE-1636.1:2013:TestResultsCollection',
'ts' : 'www.ni.com/TestStand/ATMLTestResults/3.0'}
key = 'name'
path = './/tr:Test'
xml = df.withColumn('testnames', xmlvalue('activitydetail',path,namespaces,key)).limit(10)
The XML string is ~44000 characters so I will not include it in the post. I have already prototyped the function in a separate script using one XML record from the dataframe.
Edit: The function works if I only pass the column to the function, I capitalized fromString when it should be fromstring. Still don't know why I can't pass multiple parameters though.