1

I need to fetch my array field in dataframe and assign it to a variable for further proceeding further. I am using collect() function, but its not working properly.

Input dataframe:

Department Language
[A, B, C] English
[] Spanish

How can i fetch and assign variable like below:

English = [A,B,C]

Spanish = []

5
  • Fetch array from dataframe? You can use my_variable = df1.my_column. Commented Nov 22, 2022 at 19:39
  • Please post more details about your code, and the expected output. The example you give at the end is hard to interpret, or translate into code. Commented Nov 22, 2022 at 19:41
  • I want to assign an element in my dataframe to a variable. It seems working with collect()[0][0], if array is not null. if Array is null I am getting - tuple index out of range error. Commented Nov 22, 2022 at 20:05
  • I want the code to fetch the array element and assign it to a variable as list. Even if the array is empty, I have to get a null list. Please share your thoughts. Commented Nov 22, 2022 at 20:43
  • my_variable = [df1.my_column[i]] if i<len(df1.my_column) else [] that would fetch an element as a list, or return an empty one. Is that the target application? Commented Nov 22, 2022 at 21:01

2 Answers 2

2

The simplest solution I came with is just extracting data with collect and explicitly assigning it to the predefined variables, like so:

from pyspark.sql.types import StringType, ArrayType, StructType, StructField

schema = StructType([
    StructField("Department", ArrayType(StringType()), True),
    StructField("Language", StringType(), True)
  ])

df = spark.createDataFrame([(["A", "B", "C"], "English"), ([], "Spanish")], schema)

English = df.collect()[0]["Department"]
Spanish = df.collect()[1]["Department"]
print(f"English: {English}, Spanish: {Spanish}")

# English: ['A', 'B', 'C'], Spanish: []
Sign up to request clarification or add additional context in comments.

Comments

1

EDIT: I completely brain-farted and missed that this was a PySpark question.

The below code might still be helpful if you convert your PySpark Dataframe to pandas, which for your situation might not be as ridiculous as it sounds. If the table is too big to fit in a pandas DataFrame then its too big to store all arrays in a variable. You can probably use .filter() and .select() to shrink it first.

Old Answer:


The best way to approach this really depends on the complexity of your dataframe. Here are two ways:

# To recreate your dataframe

df = pd.DataFrame({
    'Department': [['A','B', 'C']],
    'Language': 'English'
})

df.loc[df.Language == 'English']
# Will return all rows where Language is English.  If you only want Department then:

df.loc[df.Language == 'English'].Department
# This will return a list containing your list. If you are always expecting a single match add [0] as in:

df.loc[df.Language == 'English'].Department[0]
#Which will return only your list
# The alternate method below isn't great but might be preferable in some circumstances, also only if you expect a single match from any query.

department_lookup = df[['Language', 'Department']].set_index('Language').to_dict()['Department']

department_lookup['English']
#returns your list

# This will make a dictionary where 'Language' is the key and 'Department' is the value. It is more work to set up and only works for a two-column relationship but you might prefer working with dictionaries depending on the use-case

If you're having datatype issues it may deal with how the DataFrame is being loaded rather than how you're accessing it. Pandas loves to convert lists to strings.


# If I saved and reload the df as so: 
df.to_csv("the_df.csv")
df = pd.read_csv("the_df.csv")

# Then we would see that the dtype has become a string, as in "[A, B, C]" rather than ["A", "B", "C"]

# We can typically correct this by giving pandas a method for converting the incoming string to list.  This is done with the 'converters' argument, which takes a dictionary where trhe keys are column names and the values are functions, as such:

df = pd.read_csv("the_df.csv", converters = {"Department": lambda x: x.strip("[]").split(", "))

# df['Department'] should have a dtype of list

Its important to note that the lambda function is only reliable if python has converted a python list into a string in order to store the dataframe. Converting a list string into a list has been addressed here

5 Comments

Thank you for your response. Can I get equivalent approach in pyspark?
I had a brain fart and missed the PySpark tag and didn't process that you used 'collect()'
It seems working with collect()[0][0], if array is not null. if Array is null I am getting errored out with- tuple index out of range. Any thoughts to fix this?
I want the code to fetch the array element and assign it to a variable as list. Even if the array is empty, I have to get a null list.
It has been a while since I used PySpark so I don't want to offer specific snippets which may not work, but it seems to me like your issue may be best solved by filling null cells with [] before collecting

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.