2

I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name.

item_id, name
1        name1
1        name2
1        name3
2        name4
2        name5

I want to generate results with the names of an item_id concatenated.

item_id,    names
1           name1-name2-name3
2           name4-name5

How do I create such a table with Spark sql?

1

3 Answers 3

2

The beauty of Spark SQL is that once you have a solution in any of the supported languages (Scala, Java, Python, R or SQL) you can somewhat figure out other variants.

The following SQL statement seems doing what you ask for:

SELECT item_id, array_join(collect_list(name), '-') as names 
FROM tableName
GROUP BY item_id

In spark-shell it gives the following result:

scala> sql("select item_id, array_join(collect_list(name), '-') as names from so group by item_id").show
+-------+-----------------+
|item_id|            names|
+-------+-----------------+
|      1|name1-name2-name3|
|      2|      name4-name5|
+-------+-----------------+
Sign up to request clarification or add additional context in comments.

Comments

1

You can try the below -

df.orderBy('names', ascending=False)
    .groupBy('item_id')
    .agg(
        array_join(
            collect_list('names'),
            delimiter='-',
        ).alias('names')
    )

1 Comment

I can only use SQL
0

You can use Spark data frame's groupBy and agg methods and concat_ws function:

df.groupBy($"item_id").agg(concat_ws("-", collect_list($"name")).alias("names")).show()

Group fields by item_id and aggregating each name field by concatenating them together.

1 Comment

I can only use SQL

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.