How to assign array in a dataframe to a variable

Question

I need to fetch my array field in dataframe and assign it to a variable for further proceeding further. I am using collect() function, but its not working properly.

Input dataframe:

Department	Language
[A, B, C]	English
[]	Spanish

How can i fetch and assign variable like below:

English = [A,B,C]

Spanish = []

Fetch array from dataframe? You can use my_variable = df1.my_column. — C-3PO
– C-3PO, Commented Nov 22, 2022 at 19:39
Please post more details about your code, and the expected output. The example you give at the end is hard to interpret, or translate into code. — C-3PO
– C-3PO, Commented Nov 22, 2022 at 19:41
I want to assign an element in my dataframe to a variable. It seems working with collect()[0][0], if array is not null. if Array is null I am getting - tuple index out of range error. — pradeep nadarajan
– pradeep nadarajan, Commented Nov 22, 2022 at 20:05
I want the code to fetch the array element and assign it to a variable as list. Even if the array is empty, I have to get a null list. Please share your thoughts. — pradeep nadarajan
– pradeep nadarajan, Commented Nov 22, 2022 at 20:43
my_variable = [df1.my_column[i]] if i<len(df1.my_column) else [] that would fetch an element as a list, or return an empty one. Is that the target application? — C-3PO
– C-3PO, Commented Nov 22, 2022 at 21:01

Bartosz Gajda · Accepted Answer · 2022-11-22 21:07:21Z

2

The simplest solution I came with is just extracting data with collect and explicitly assigning it to the predefined variables, like so:

from pyspark.sql.types import StringType, ArrayType, StructType, StructField

schema = StructType([
    StructField("Department", ArrayType(StringType()), True),
    StructField("Language", StringType(), True)
  ])

df = spark.createDataFrame([(["A", "B", "C"], "English"), ([], "Spanish")], schema)

English = df.collect()[0]["Department"]
Spanish = df.collect()[1]["Department"]
print(f"English: {English}, Spanish: {Spanish}")

# English: ['A', 'B', 'C'], Spanish: []

edited Nov 22, 2022 at 21:07

answered Nov 22, 2022 at 21:00

Bartosz Gajda

1,1877 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

psychicesp · Accepted Answer · 2022-11-22 20:50:32Z

1

EDIT: I completely brain-farted and missed that this was a PySpark question.

The below code might still be helpful if you convert your PySpark Dataframe to pandas, which for your situation might not be as ridiculous as it sounds. If the table is too big to fit in a pandas DataFrame then its too big to store all arrays in a variable. You can probably use .filter() and .select() to shrink it first.

Old Answer:

The best way to approach this really depends on the complexity of your dataframe. Here are two ways:

# To recreate your dataframe

df = pd.DataFrame({
    'Department': [['A','B', 'C']],
    'Language': 'English'
})

df.loc[df.Language == 'English']
# Will return all rows where Language is English.  If you only want Department then:

df.loc[df.Language == 'English'].Department
# This will return a list containing your list. If you are always expecting a single match add [0] as in:

df.loc[df.Language == 'English'].Department[0]
#Which will return only your list
# The alternate method below isn't great but might be preferable in some circumstances, also only if you expect a single match from any query.

department_lookup = df[['Language', 'Department']].set_index('Language').to_dict()['Department']

department_lookup['English']
#returns your list

# This will make a dictionary where 'Language' is the key and 'Department' is the value. It is more work to set up and only works for a two-column relationship but you might prefer working with dictionaries depending on the use-case

If you're having datatype issues it may deal with how the DataFrame is being loaded rather than how you're accessing it. Pandas loves to convert lists to strings.


# If I saved and reload the df as so: 
df.to_csv("the_df.csv")
df = pd.read_csv("the_df.csv")

# Then we would see that the dtype has become a string, as in "[A, B, C]" rather than ["A", "B", "C"]

# We can typically correct this by giving pandas a method for converting the incoming string to list.  This is done with the 'converters' argument, which takes a dictionary where trhe keys are column names and the values are functions, as such:

df = pd.read_csv("the_df.csv", converters = {"Department": lambda x: x.strip("[]").split(", "))

# df['Department'] should have a dtype of list

Its important to note that the lambda function is only reliable if python has converted a python list into a string in order to store the dataframe. Converting a list string into a list has been addressed here

edited Nov 22, 2022 at 20:50

answered Nov 22, 2022 at 20:05

psychicesp

3122 silver badges7 bronze badges

5 Comments

pradeep nadarajan Over a year ago

Thank you for your response. Can I get equivalent approach in pyspark?

psychicesp Over a year ago

I had a brain fart and missed the PySpark tag and didn't process that you used 'collect()'

pradeep nadarajan Over a year ago

It seems working with collect()[0][0], if array is not null. if Array is null I am getting errored out with- tuple index out of range. Any thoughts to fix this?

pradeep nadarajan Over a year ago

I want the code to fetch the array element and assign it to a variable as list. Even if the array is empty, I have to get a null list.

psychicesp Over a year ago

It has been a while since I used PySpark so I don't want to offer specific snippets which may not work, but it seems to me like your issue may be best solved by filling null cells with [] before collecting

Collectives™ on Stack Overflow

How to assign array in a dataframe to a variable

2 Answers 2

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related