0

I have created a UDF to take an XML string, namespace dictionary, x-path syntax, and key for the key value pair within the XML, and return an array of values to be exploded later using a withColumn(col,explode(col)).

I am now trying to iterate this function over a dataframe with a column containing XML strings in Databricks using Pyspark and create a new column with the returned arrays.

So far I have used this post as the idea for my original approach and read this post on passing a whole row to a withColumn.

I expect my problem is either: with how I am passing the column to the function OR how many arguments my function has.


My function:

from pyspark.sql.functions import udf, struct
from pyspark.sql.types import *
import xml.etree.ElementTree as ET  

def valuelist(xml,path,nsmap,key):
    empty = []
    tree = ET.fromString(xml)
    for value in tree.findall(path,nsmap):
        empty.append(value.get(key))
    return empty

xmlvalue = udf(valuelist, ArrayType(StringType(),True))

Application of the function:

namespaces = {'c' : 'urn:IEEE-1671:2010:Common',
              'sc' : 'urn:IEEE-1636.99:2013:SimicaCommon',
              'tr' : 'urn:IEEE-1636.1:2013:TestResults',
              'trc' : 'urn:IEEE-1636.1:2013:TestResultsCollection',
              'ts' : 'www.ni.com/TestStand/ATMLTestResults/3.0'}
key = 'name'
path = './/tr:Test'

xml = df.withColumn('testnames', xmlvalue('activitydetail',path,namespaces,key)).limit(10)

The XML string is ~44000 characters so I will not include it in the post. I have already prototyped the function in a separate script using one XML record from the dataframe.


Edit: The function works if I only pass the column to the function, I capitalized fromString when it should be fromstring. Still don't know why I can't pass multiple parameters though.

1 Answer 1

0

The problem seems to be because you are passing strings or constants to the UDF which you cannot do without using lit function.

In spark 2.2 there are two ways to add constant value in a column in DataFrame:

1) Using lit

2) Using typedLit

The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map

To keep it simple you need a Column (can be a one created using lit but it is not the only option) when JVM counterpart expects a column and there is no internal conversion in a Python wrapper or you wan to call a Column specific method.

example:

from pyspark.sql.functions import lit
df = df.withColumn("Today's Date", lit(datetime.now()))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.