1

I'm new to pyspark and don't yet have a full overview of the avl. methods. I want to get unique values of a single column of a pyspark dataframe. This approach doesn't work:

F.array_distinct(my_spark_df.my_column).???

Whatever ???-function I try to apply to the column, toPandas(), collect(), display() etc., I get:

TypeError: 'Column' object is not callable

I also found this thread which is similar, but didn't help in my case since I want to select only distinct values before collecting them.

2 Answers 2

1

Directly after posting my question, I had another idea and it worked :)

Seems I was on the wrong track. The column-functions are probably the wrong approach here, instead we need to keep the dataframe, do the operations there and then we have the toPandas() method available:

my_spark_df.select("my_column").distinct().toPandas()
Sign up to request clarification or add additional context in comments.

Comments

1

If you just want distinct values for my_column you can try:

my_spark_df.select('my_column').distinct().collect()

This will give Row objects

You can get the list of values by:

distinct_vals = my_spark_df.select('my_column').distinct().collect()
distinct_vals = [a[my_column] for a in distinct_vals]

1 Comment

This gives a list of Row-objects. Actually don't know what to do with them either

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.