Pyspark 2.1.0 wrapped array to array

Question

I have a Spark (Python) dataframe with two columns: a user ID and then an array of arrays, which is represented in Spark as a wrapped array like so:

[WrappedArray(9, 10, 11, 12), WrappedArray(20, 21, 22, 23, 24, 25, 26)]

In its usual representation this would look like this:

[[9, 10, 11, 12], [20, 21, 22, 23, 24, 25, 26]]

I want to perform operations on each of the subarrays, for example take a third list and check whether any of its values is in the first sub-array, but I can't seem to find solutions for pyspark 2.0 (only Scala-specific older solutions like this and this).

How does one access (and in general work with) wrapped arrays? What is an efficient way to do what I described above?

Pushkr · Accepted Answer · 2017-04-07 03:10:03Z

1

You can treat each wrapped array as individual list . in your example, if you want to which elements from 2nd wrapped array is present in first array, you could do something like -

# Prepare data 
data = [[10001,[9, 10, 11, 12],[20, 10, 9, 23, 24, 25, 26]],
        [10002,[8, 1, 2, 3],[49, 3, 6, 5, 6]],
       ]
rdd = sc.parallelize(data) 

df = rdd.map( 
        lambda row : row+[
                          [x for x in row[2] if x in row[1]]
                         ]
           ).toDF(["userID","array1","array2","commonElements"])

df.show()

output :

+------+---------------+--------------------+--------------+
|userID|         array1|              array2|commonElements|
+------+---------------+--------------------+--------------+
| 10001|[9, 10, 11, 12]|[20, 10, 9, 23, 2...|       [10, 9]|
| 10002|   [8, 1, 2, 3]|    [49, 3, 6, 5, 6]|           [3]|
+------+---------------+--------------------+--------------+

answered Apr 7, 2017 at 3:10

Pushkr

3,62921 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

xv70 Over a year ago

thanks, what would be a solution using dataframes instead of RDD's?

Pushkr Over a year ago

maybe something like .getItem(num) which gets you the item from the list if column is list.

Collectives™ on Stack Overflow

Pyspark 2.1.0 wrapped array to array

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related