Make JSON in Spark's structured streaming accessible in python (pyspark) as dataframe without RDD

Question

I use Spark 2.4.3 and want to do structured streaming with data from a Kafka source. The following code works so far:

from pyspark.sql import SparkSession
from ast import literal_eval

spark = SparkSession.builder \
    .appName("streamer") \
    .getOrCreate()

# Create DataFrame representing the stream
dsraw = spark.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "test") \
  .option("startingOffsets", """{"test":{"0":2707422}}""") \
  .load()

# Convert Kafka stream to something readable
ds = dsraw.selectExpr("CAST(value AS STRING)")

# Do query on the raw data
rawQuery = dsraw \
     .writeStream \
     .queryName("qraw") \
     .format("memory") \
     .start()
raw = spark.sql("select * from qraw")

# Do query on the converted data
dsQuery = ds \
     .writeStream \
     .queryName("qds") \
     .format("memory") \
     .start()
sdf = spark.sql("select * from qds")

# I have to access raw otherwise I get errors...
raw.select("value").show()

sdf.show()

# Make the json stuff accessable
sdf2 = sdf.rdd.map(lambda val: literal_eval(val['value']))
print(sdf2.first())

But I really wonder if the convertion in the next to last line is the most useful/fastest one. Do you have other ideas? Can I stay with (Spark) dataframes instead of the RDD?

The output of the script is

+--------------------+
|               value|
+--------------------+
|{
  "Signal": "[...|
|{
  "Signal": "[...|
+--------------------+
only showing top 20 rows

{'Signal': '[1234]', 'Value': 0.0, 'Timestamp': '2019-08-27T13:51:43.7146327Z'}

tardis · Accepted Answer · 2019-08-30 11:49:33Z

2

There are some solutions out there but only this adapted solution does work (credit goes to https://stackoverflow.com/a/51070457/3021134):

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructField, StructType, StringType, DoubleType

schema = StructType(
        [
                StructField("Signal", StringType()),
                StructField("Value", DoubleType()),
                StructField("Timestamp", StringType())
        ]
)

sdf.withColumn("value", from_json("value", schema))\
    .select(col('value.*'))\
    .show()

with the output:

+--------+-----------+--------------------+
|  Signal|      Value|           Timestamp|
+--------+-----------+--------------------+
|[123456]|        0.0|2019-08-27T13:51:...|
|[123457]|        0.0|2019-08-27T13:51:...|
|[123458]| 318.880859|2019-08-27T13:51:...|
|[123459]|   285.5808|2019-08-27T13:51:...|

edited Aug 30, 2019 at 11:49

answered Aug 30, 2019 at 11:21

tardis

1,4104 gold badges27 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Make JSON in Spark's structured streaming accessible in python (pyspark) as dataframe without RDD

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related