How is the string column in DataFrame split into multiple columns when Spark Structed Streaming

Question

This is the current code:

from pyspark.sql import SparkSession

park_session = SparkSession\
    .builder\
    .appName("test")\
    .getOrCreate()

lines = spark_session\
    .readStream\
    .format("socket")\
    .option("host", "127.0.0.1")\
    .option("port", 9998)\
    .load()

The 'lines' looks like this:
+-------------+
|    value    |
+-------------+
|     a,b,c   |
+-------------+

But I want to look like this：
+---+---+---+
| a | b | c |
+---+---+---+

I tried using the 'split()' method, but it didn't work. You could only split each string into a list in a column, not into multiple columns

What should I do?

notNull · Accepted Answer · 2020-04-20 14:56:53Z

0

Split the value column and by accessing array index (or) element_at(from spark-2.4) (or) getItem() functions to create new columns.

from pyspark.sql.functions import *

lines.withColumn("tmp",split(col("value"),',')).\
withColumn("col1",col("tmp")[0]).\
withColumn("col2",col("tmp").getItem(1)).\
withColumn("col3",element_at(col("tmp"),3))
drop("tmp","value").\
show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|   a|   b|   c|
#+----+----+----+

edited Apr 20, 2020 at 14:56

answered Apr 20, 2020 at 14:21

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

NickWick Over a year ago

Thank you for your answer, but what is 'col()' is method? I haven't used it. I'm sorry;I haven't been around long

notNull Over a year ago

@NickWick, col() is used to access column from dataframe.. i.e lines.value,lines["value"],col("value") all are referring to value column in pyspark.

NickWick Over a year ago

Thanks for your careful explanation. I understand now

Ajay Kharade · Accepted Answer · 2020-04-20 14:44:52Z

0

from pyspark.sql.functions import *
import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark_session = SparkSession\
    .builder\
    .appName("test")\
    .getOrCreate()

lines = spark_session\
    .readStream\
    .format("socket")\
    .option("host", "127.0.0.1")\
    .option("port", 9998)\
    .load()

split_col = f.split(lines['value'], ",")
df = df.withColumn('col1', split_col.getItem(0))
df = df.withColumn('col2', split_col.getItem(1))
df = df.withColumn('col2', split_col.getItem(2))

df.show()

answered Apr 20, 2020 at 14:44

Ajay Kharade

1,5351 gold badge17 silver badges32 bronze badges

Comments

anky · Accepted Answer · 2020-04-20 15:19:10Z

0

Incase you have different numbers of delimiters and not just 3 for each row , you can use the below:

Input:

+-------+
|value  |
+-------+
|a,b,c  |
|d,e,f,g|
+-------+

Solution

import pyspark.sql.functions as F

max_size = df.select(F.max(F.length(F.regexp_replace('value','[^,]','')))).first()[0]
out = df.select([F.split("value",',')[x].alias(f"Col{x+1}") for x in range(max_size+1)])

Output

out.show()

+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|   a|   b|   c|null|
|   d|   e|   f|   g|
+----+----+----+----+

edited Apr 20, 2020 at 15:19

answered Apr 20, 2020 at 15:07

anky

75.3k11 gold badges46 silver badges76 bronze badges

Collectives™ on Stack Overflow

How is the string column in DataFrame split into multiple columns when Spark Structed Streaming

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related