2
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()

simpleData = [("James", "Sales", "NY", 90000, 34, 10000),
                ("Michael", "Sales", "NY", 86000, 56, 20000),
                ("Robert", "Sales", "CA", 81000, 30, 23000),
                ("Maria", "Finance", "CA", 90000, 24, 23000),
                ("Raman", "Finance", "CA", 99000, 40, 24000),
                ("Scott", "Finance", "NY", 83000, 36, 19000),
                ("Jen", "Finance", "NY", 79000, 53, 15000),
                ("Jeff", "Marketing", "CA", 80000, 25, 18000),
                ("Kumar", "Marketing", "NY", 91000, 50, 21000)
                ]
schema = ["employee_name", "department",
            "state", "salary", "age", "bonus"]
df = spark.createDataFrame(data=simpleData, schema=schema)
data = df.groupBy("department").count() \
    .select(col("department").alias("name"), col("count").alias("value")) \
    .toJSON().collect()
print(data)

spark.stop()

When I ran the code it gives an array string:

[
 '{"name":"Sales","value":3}', 
 '{"name":"Finance","value":4}', 
 '{"name":"Marketing","value":2}'
]

but I don't want an array string, I want an array object to send to frontend

[
 {"name":"Sales","value":3}, 
 {"name":"Finance","value":4}, 
 {"name":"Marketing","value":2}
]

Can anyone help me?

2
  • What happens if you call .collect before calling .toJSON ? Commented Feb 8, 2022 at 5:30
  • It throws an exception AttributeError: 'list' object has no attribute 'toJSON' Commented Feb 8, 2022 at 5:53

2 Answers 2

2

Remove the toJSON() and use the following list comprehension

[d.asDict() for d in data] 
Sign up to request clarification or add additional context in comments.

Comments

0

You need to deserialize the json on the driver side:

import json
deserialized_data = [json.loads(s) for s in data]

print(deserialized_data)
print(deserialized_data[0]["name"])

If you add the above lines after the collect, then deserialized_data will be a list of dictionary objects.

Output of the two prints with your provided data:

[{'name': 'Finance', 'value': 4}, {'name': 'Sales', 'value': 3}, {'name': 'Marketing', 'value': 2}]
Finance

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.