0

My objective is to fetch a column values into a variable if possible as a list from pyspark dataframe.

Expected output = ["a", "b", "c", ... ]

I tried :

[
  col.__getitem__("x")
  for col in data.select("x").collect()
]

But it gives list of Row objects.

Output : [Row(x='a'), Row(x='b'), Row(x='c'), ...]

I don't want to use collect as well as don't need Row objects.

tried another method :

data.select(f.collect_list("x")).collect()

slightly better then earlier version but gets:

Output = [Row(collect_list(x) = ['a', 'b', 'c', ...]]

Thanks in advance and Happy new year!

2
  • Are you using Azure databricks or AWS? Commented Jan 2, 2024 at 13:18
  • 1
    @DileeprajnarayanThumula Sorry to ask you this question but why it depends upon the cloud? BTW it's Azure Commented Jan 2, 2024 at 13:28

1 Answer 1

0

Tried three different solution :

df.select(f.collect_list("x").alias("temp")).first()["temp"] 
Time taken : 32.43s

df.select("x").rdd.flatMap(lambda x:x).collect()
Time taken : 13.19s

[col.__getitem__("x") for col in df.select("x").collect()]
Time taken : 22.77s

Even though I'm using collect but it was faster then other solutions. P.S df.count ~ 116M

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.