0

I want to convert a Spark DataFrame into another DataFrame with a specific manner as follows:

I have Spark DataFrame:

+---------+------------+
|protocol |   count    |
+---------+------------+
|      TCP|    8231    |
|     ICMP|    7314    |
|      UDP|    5523    |
|     IGMP|    4423    |
|      EGP|    2331    |
+---------+------------+

And I want to turn it into:

+----------------------------------------------------------+
|Aggregated                                                |
+----------------------------------------------------------+
|{tcp: 8231, icmp: 7314, udp: 5523, igmp: 4423, egp: 2331} |
+----------------------------------------------------------+

The aggregated column can be both list or map, or string. Is this possible with DataFrame functions or do I need to create my own udf to aggregate this ?

2
  • Do you want to use all the rows in the dataframe when doing this? Since there are not that many it would probably be easier to collect the data and use pure Scala to convert it. Commented May 17, 2018 at 8:40
  • @Shaido Yes, all of the rows, and yet the example that given is not the only rows that exist. The rows will keep increasing by the time Commented May 17, 2018 at 8:43

3 Answers 3

2

pivot and toJSON will give you what you need

import org.apache.spark.sql.functions.first

df.groupBy().pivot("protocol").agg(first("count")).toJSON.show(false)
// +----------------------------------------------------------+                    
// |value                                                     |
// +----------------------------------------------------------+
// |{"EGP":2331,"ICMP":7314,"IGMP":4423,"TCP":8321,"UDP":5523}|
// +----------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

0

Since you want to convert all columns to a single one and it does not seem to be many columns to begin with, you can collect the dataframe to the driver and use pure Scala code to convert it into the format you want.

The following will give you a Array[String]:

val res = df.as[(String, Int)].collect.map{case(protocol, count) => protocol + ": " + count}

To convert it into a single string, simply do:

val str = res.mkString("{", ", ", "}")

Comments

0

Concat columns in dataframe and create a new column:

var new_df = df.withColumn("concat", concat($"protocol", lit(" : "), $"count"))

To aggregate it into a single row as a list you can do this.

var new_df = new_df.groupBy().agg(collect_list("concat").as("aggregated"))
new_df.show

If you want to get the data into a string instead of dataframe, you can collect it as following.

new_df.select("concat").collect.map(x=> x.get(0)).mkString("{", ",", "}")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.