How to extract array element from array column?

Question

I'm working with a dataset available here: https://www.kaggle.com/datasets/lehaknarnauli/spotify-datasets?select=artists.csv. What I want to do is to extract first element of each array in column genres. For example, if I got ['pop', 'rock'] I'd like to extract 'pop'. I tried different approaches but none of them works, I don't know why.

Here is my code:

import pandas as pd

df = pd.read_csv('artists.csv')

# approach 1
df['top_genre'] = df['genres'].str[0]
# Error: 'str' object has no attribute 'str'

# approach 2
df = df.assign(top_genre = lambda x: df['genres'].str[0])
# The result is single bracket '[' in each row. Seems like index=0 refers to first character of a string, not first array element.

# approach 3
df['top_genre'] = df['genres'].apply(lambda x: '[]' if not x else x[0])
# The result is single bracket '[' in each row. Seems like index=0 refers to first character of a string, not first array element.

Why these approaches doesn't work and how to make it work out?

N_Z · Accepted Answer · 2022-12-27 19:28:53Z

2

Another way to do it:

import json
df["top_genre"]=df["genres"].apply(lambda x: None if x == '[]' else json.loads(x)[0])

edited Dec 27, 2022 at 19:28

answered Dec 27, 2022 at 19:16

N_Z

9371 gold badge11 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Akshay Sehgal · Accepted Answer · 2022-12-27 19:05:47Z

1

Your genres column seems to not actually be a list, but instead, a string that contains a list such as "['a', 'b']". You will have to use eval on the string to convert each row into a list object again, but for safety reasons, its better to use ast.literal_eval

Using Converter during reading the dataset

One way is to apply a converter while loading the dataset itself using the converters parameter. The advantage of this method is that you can do multiple transformations and typecasting using a single dictionary, which can apply on a large number of similar files at once, if needed.

from ast import literal_eval

df = pd.read_csv('/path_do_data/artists.csv', 
                 converters={'genres': literal_eval})
df['genres'].str[0]

0                        NaN
1                        NaN
2                        NaN
3                        NaN
4                        NaN
                 ...        
1104344                  NaN
1104345    deep acoustic pop
1104346                  NaN
1104347                  NaN
1104348                  NaN

Using apply method on a column

Another way to solve this is to just convert the string using literal_eval. This step needs multiple lines of code to overwrite existing columns but works as well, just a bit redundant in my opinion.

from ast import literal_eval

df = pd.read_csv('/path_do_data/artists.csv')
df['genres'] = df['genres'].apply(literal_eval)
df['genres'].str[0]

0                        NaN
1                        NaN
2                        NaN
3                        NaN
4                        NaN
                 ...        
1104344                  NaN
1104345    deep acoustic pop
1104346                  NaN
1104347                  NaN

answered Dec 27, 2022 at 19:05

Akshay Sehgal

19.4k3 gold badges26 silver badges57 bronze badges

6 Comments

Akshay Sehgal Over a year ago

Glad to help, feel free to mark it if it was helpful!

mustafa00 Over a year ago

One more thing - even if I use this solution I still can't extract first element easily. E.g. deep acoustic pop - element deep can't be retrieved using df['genres'].str[0]. I would have to use more complex function to do this. How can I convert genres column to list of arrays?

Akshay Sehgal Over a year ago

the "deep acoustic pop" is the first genre in the list of genres present in the data. for that row ['deep acoustic pop', 'mississippi indie'] so the code is working as expected. i double-checked. so you can just use df['genres'].str[0] to get first element / genre from the list of genres in each row

Akshay Sehgal Over a year ago

do try and let me know if any issues.

Akshay Sehgal Over a year ago

You can simply chain the str methods as such df['genres'].str[0].str.split().str[0]. Avoid using apply as its very slow compared to the vectorized str methods. This will work on the rows that have some text, but may fail on nan values. So you may need to do the df['genres'].str[0] first, then fill nan values, and then try the df['genres'].str.split().str[0] again. I am afk right now, so give me some time to share a solution.

|

Collectives™ on Stack Overflow

How to extract array element from array column?

2 Answers 2

Comments

Using Converter during reading the dataset

Using apply method on a column

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Using Converter during reading the dataset

Using apply method on a column

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related