Pyspark dataframe to python list

Question

I have the below pyspark dataframe.

Job_name start_time status
A        09:00:00   Not started
B        09:30:00   Completed
C        09:30:00   Running

I am trying to create a list - from above dataframe like below.

Expected output:

lst = ["job A Not started at 09:00:00", "job B Completed at 9:30", "job C Running at 9.30"]

Is there a way to convert pyspark dataframe to a list like above ?

It's similar.but not exactly same. I require all columns of a row to populate to a single output list — Padfoot123
– Padfoot123, Commented Sep 21, 2021 at 11:49
you could build an additional column with the concatenation of the existing ones then use the method above — Christophe
– Christophe, Commented Sep 21, 2021 at 11:51

sophocles · Accepted Answer · 2021-09-21 12:20:23Z

3

Using your DF, rearranged slightly:

df = df.select('Job_name','status','start_time')

>>> df.show()
+--------+-----------+----------+
|Job_name|     status|start_time|
+--------+-----------+----------+
|       A|Not started|  09:00:00|
|       B|  Completed|  09:30:00|
|       C|    Running|  09:30:00|
+--------+-----------+----------+

Try the below, using a list comprehension:

li = []
one = [li.append(' '.join(('Job',row[0],row[1],'at',row[2]))) for row in df.collect()]

>>> li

Out[123]: 

['Job A Not started at 09:00:00',
 'Job B Completed at 09:30:00',
 'Job C Running at 09:30:00']

answered Sep 21, 2021 at 12:20

sophocles

13.9k3 gold badges18 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Lue Mar · Accepted Answer · 2021-09-21 12:15:59Z

2

You can try with df.rdd.collect() :

print(['job {}{} at {}'.format(i,j,k) for i,j,k in zip([row['Job_name']],[row['status']], [row['start_time']]) for row in df.rdd.collect()])

should do what you're looking for.

answered Sep 21, 2021 at 12:15

Lue Mar

4727 silver badges10 bronze badges

Comments

Vaebhav · Accepted Answer · 2021-09-21 12:28:51Z

0

You should combine the required files together using spark before hand then finally extract the results

Data Preparation

input_str = """
A|09:00:00   |Not started|
B|09:30:00   |Completed|
C|09:30:00   |Running
""".split("|")

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "Job_name start_time status".split()))
            
n = len(input_values)
n_cols = 3

input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF.show()

+--------+----------+-----------+
|Job_name|start_time|     status|
+--------+----------+-----------+
|       A|  09:00:00|Not started|
|       B|  09:30:00|  Completed|
|       C|  09:30:00|    Running|
+--------+----------+-----------+

Collect

[x['output'] for x in 
    sparkDF.select(F.concat(
                    F.lit("job "),F.col('Job_name'),F.lit(" "),F.col("status"),F.lit(" at "),F.col('start_time')
                    ).alias("output")
              ).collect()

]

['job A Not started at 09:00:00',
 'job B Completed at 09:30:00',
 'job C Running at 09:30:00']

edited Sep 21, 2021 at 12:28

answered Sep 21, 2021 at 11:57

Vaebhav

5,0921 gold badge17 silver badges37 bronze badges

1 Comment

sophocles Over a year ago

Unfortunately i don't think this is the output that the OP expects.

Suresh · Accepted Answer · 2021-09-21 12:39:27Z

Can achieve using concat and collect_list functions

from pyspark.sql import functions as f

df = df.withColumn('ncol',f.concat_ws(" ",f.lit('Job'),f.col('Job_name'),f.col('status'),f.lit("at"),f.col('start_time')))
df.show(truncate=False)

+--------+----------+-----------+-----------------------------+
|Job_name|start_time|status     |ncol                         |
+--------+----------+-----------+-----------------------------+
|A       |09:00:00  |Not started|Job A Not started at 09:00:00|
|B       |09:30:00  |Completed  |Job B Completed at 09:30:00  |
|C       |09:30:00  |Running    |Job C Running at 09:30:00    |
+--------+----------+-----------+-----------------------------+

df.select(f.collect_list('ncol')).head()[0]

['Job B Completed at 09:30:00','Job C Running at 09:30:00','Job A Not started at 09:00:00']

Collectives™ on Stack Overflow

Pyspark dataframe to python list

4 Answers 4

Comments

Comments

Data Preparation

Collect

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Data Preparation

Collect

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related