1

I have the below pyspark dataframe.

Job_name start_time status
A        09:00:00   Not started
B        09:30:00   Completed
C        09:30:00   Running

I am trying to create a list - from above dataframe like below.

Expected output:

lst = ["job A Not started at 09:00:00", "job B Completed at 9:30", "job C Running at 9.30"]

Is there a way to convert pyspark dataframe to a list like above ?

3
  • Does this topic help you? Commented Sep 21, 2021 at 11:47
  • It's similar.but not exactly same. I require all columns of a row to populate to a single output list Commented Sep 21, 2021 at 11:49
  • you could build an additional column with the concatenation of the existing ones then use the method above Commented Sep 21, 2021 at 11:51

4 Answers 4

3

Using your DF, rearranged slightly:

df = df.select('Job_name','status','start_time')

>>> df.show()
+--------+-----------+----------+
|Job_name|     status|start_time|
+--------+-----------+----------+
|       A|Not started|  09:00:00|
|       B|  Completed|  09:30:00|
|       C|    Running|  09:30:00|
+--------+-----------+----------+

Try the below, using a list comprehension:

li = []
one = [li.append(' '.join(('Job',row[0],row[1],'at',row[2]))) for row in df.collect()]

>>> li

Out[123]: 

['Job A Not started at 09:00:00',
 'Job B Completed at 09:30:00',
 'Job C Running at 09:30:00']
Sign up to request clarification or add additional context in comments.

Comments

2

You can try with df.rdd.collect() :

print(['job {}{} at {}'.format(i,j,k) for i,j,k in zip([row['Job_name']],[row['status']], [row['start_time']]) for row in df.rdd.collect()])

should do what you're looking for.

Comments

0

You should combine the required files together using spark before hand then finally extract the results

Data Preparation

input_str = """
A|09:00:00   |Not started|
B|09:30:00   |Completed|
C|09:30:00   |Running
""".split("|")

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "Job_name start_time status".split()))
            
n = len(input_values)
n_cols = 3

input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF.show()

+--------+----------+-----------+
|Job_name|start_time|     status|
+--------+----------+-----------+
|       A|  09:00:00|Not started|
|       B|  09:30:00|  Completed|
|       C|  09:30:00|    Running|
+--------+----------+-----------+

Collect

[x['output'] for x in 
    sparkDF.select(F.concat(
                    F.lit("job "),F.col('Job_name'),F.lit(" "),F.col("status"),F.lit(" at "),F.col('start_time')
                    ).alias("output")
              ).collect()

]

['job A Not started at 09:00:00',
 'job B Completed at 09:30:00',
 'job C Running at 09:30:00']

1 Comment

Unfortunately i don't think this is the output that the OP expects.
0

Can achieve using concat and collect_list functions

from pyspark.sql import functions as f

df = df.withColumn('ncol',f.concat_ws(" ",f.lit('Job'),f.col('Job_name'),f.col('status'),f.lit("at"),f.col('start_time')))
df.show(truncate=False)

+--------+----------+-----------+-----------------------------+
|Job_name|start_time|status     |ncol                         |
+--------+----------+-----------+-----------------------------+
|A       |09:00:00  |Not started|Job A Not started at 09:00:00|
|B       |09:30:00  |Completed  |Job B Completed at 09:30:00  |
|C       |09:30:00  |Running    |Job C Running at 09:30:00    |
+--------+----------+-----------+-----------------------------+

df.select(f.collect_list('ncol')).head()[0]

['Job B Completed at 09:30:00','Job C Running at 09:30:00','Job A Not started at 09:00:00']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.