0

I have an Array column in PySpark which looks something like this

Parsed

[/level1/level2, /level2/level3/level4, /level1/level2]
[/level1/level2, /level2/level3]
[/level1/level2, /level1/level2/level3, /level1/level2]
[/level1/level2, /level1/level2/level3, /level1/level2]

I want something like this

Parsed
[level1, level2, level1]
[level1, level2]
[level1, level1, level1]

Sample row

Parsed
[/president/vicepresident/seniormanager, /president/seniormanager, /manager/teamlead]

Expected output

Parsed
[ president, president, manager]

Basically, I want only the first level in each element of array

My approach

df.filter(col("Parsed")
    .rlike("/Level/")
  ).show()
2
  • share sample row and expected out, spark version... etc Commented Jul 27, 2022 at 13:59
  • is it a nested json or csv ,please write all necessary it is very difficult? Commented Jul 27, 2022 at 14:04

2 Answers 2

1

Here's an approach in .

# input data
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['arr_col'])

# +-------------------------------------------------------+
# |arr_col                                                |
# +-------------------------------------------------------+
# |[/level1/level2, /level2/level3/level4, /level1/level2]|
# |[/level1/level2, /level2/level3]                       |
# |[/level1/level2, /level1/level2/level3, /level1/level2]|
# |[/level1/level2, /level1/level2/level3, /level1/level2]|
# +-------------------------------------------------------+

data_sdf. \
    withColumn('rn', func.row_number().over(wd.orderBy(func.lit(1)))). \
    withColumn('arr_col_exploded', func.explode('arr_col')). \
    withColumn('arr_col_exp_split', func.split('arr_col_exploded', '\/')). \
    withColumn('first_element', func.col('arr_col_exp_split')[1]). \
    groupBy('rn', 'arr_col'). \
    agg(func.collect_list('first_element').alias('first_element_arr')). \
    show(truncate=False)

# +---+-------------------------------------------------------+------------------------+
# |rn |arr_col                                                |first_element_arr       |
# +---+-------------------------------------------------------+------------------------+
# |1  |[/level1/level2, /level2/level3/level4, /level1/level2]|[level1, level2, level1]|
# |2  |[/level1/level2, /level2/level3]                       |[level1, level2]        |
# |3  |[/level1/level2, /level1/level2/level3, /level1/level2]|[level1, level1, level1]|
# |4  |[/level1/level2, /level1/level2/level3, /level1/level2]|[level1, level1, level1]|
# +---+-------------------------------------------------------+------------------------+

The rn is to help in grouping, if there are duplicate input arrays. The idea is to explode the input array and then split the exploded elements which creates an array of the elements that were delimited by '/'. Once split, we can pull out the second element (which is actually the first element) as the first will be a null (due to the first '/'). Finally, use collect_list to create an array of the first elements.

Sign up to request clarification or add additional context in comments.

Comments

1

You can use pd.DataFrame.explode to explode the rows containing list of string then using pd.Series.str.split you can get the first data and then we can group the data back on index by using pd.DataFrame.groupby.

dict_ = { 'levels' : [['/level1/level2', '/level2/level3/level4', '/level1/level2'],
['/level1/level2', '/level2/level3'],
['/level1/level2', '/level1/level2/level3', '/level1/level2'],
['/level1/level2', '/level1/level2/level3', '/level1/level2']]}

df = pd.DataFrame(dict_)

df['levels'].explode().str.split('/', expand=True)[1].reset_index().groupby('index')[1].apply(list)

Output

which gives us the expected output

0    [level1, level2, level1]
1            [level1, level2]
2    [level1, level1, level1]
3    [level1, level1, level1]
Name: 1, dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.