3

This is the current code:

from pyspark.sql import SparkSession

park_session = SparkSession\
    .builder\
    .appName("test")\
    .getOrCreate()

lines = spark_session\
    .readStream\
    .format("socket")\
    .option("host", "127.0.0.1")\
    .option("port", 9998)\
    .load()

The 'lines' looks like this:
+-------------+
|    value    |
+-------------+
|     a,b,c   |
+-------------+

But I want to look like this:
+---+---+---+
| a | b | c |
+---+---+---+

I tried using the 'split()' method, but it didn't work. You could only split each string into a list in a column, not into multiple columns

What should I do?

0

3 Answers 3

0

Split the value column and by accessing array index (or) element_at(from spark-2.4) (or) getItem() functions to create new columns.


from pyspark.sql.functions import *

lines.withColumn("tmp",split(col("value"),',')).\
withColumn("col1",col("tmp")[0]).\
withColumn("col2",col("tmp").getItem(1)).\
withColumn("col3",element_at(col("tmp"),3))
drop("tmp","value").\
show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|   a|   b|   c|
#+----+----+----+
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your answer, but what is 'col()' is method? I haven't used it. I'm sorry;I haven't been around long
@NickWick, col() is used to access column from dataframe.. i.e lines.value,lines["value"],col("value") all are referring to value column in pyspark.
Thanks for your careful explanation. I understand now
0
from pyspark.sql.functions import *
import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark_session = SparkSession\
    .builder\
    .appName("test")\
    .getOrCreate()

lines = spark_session\
    .readStream\
    .format("socket")\
    .option("host", "127.0.0.1")\
    .option("port", 9998)\
    .load()

split_col = f.split(lines['value'], ",")
df = df.withColumn('col1', split_col.getItem(0))
df = df.withColumn('col2', split_col.getItem(1))
df = df.withColumn('col2', split_col.getItem(2))

df.show()

Comments

0

Incase you have different numbers of delimiters and not just 3 for each row , you can use the below:

Input:

+-------+
|value  |
+-------+
|a,b,c  |
|d,e,f,g|
+-------+

Solution

import pyspark.sql.functions as F

max_size = df.select(F.max(F.length(F.regexp_replace('value','[^,]','')))).first()[0]
out = df.select([F.split("value",',')[x].alias(f"Col{x+1}") for x in range(max_size+1)])

Output

out.show()

+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|   a|   b|   c|null|
|   d|   e|   f|   g|
+----+----+----+----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.