Assign int to strings in a column of lists in pandas

Question

I have a Pandas dataframe that contains a column with lists of strings.

>>> df.head()

   genre
0  [Comedy,  Supernatural,  Romance]
1  [Comedy,  Parody,  Romance]
2  [Comedy]
3  [Comedy,  Drama,  Romance,  Fantasy]
4  [Comedy,  Drama,  Romance]

How could I go about assigning each of the values in the list a unique id that would be the same across the column?

>>> df.head()

   genre
0  [1,  2,  3]
1  [1,  4,  3]
2  [1]
3  [1,  5,  3,  6]
4  [1,  5,  3]

cs95 · Accepted Answer · 2020-11-15 22:27:32Z

3

The complication here is we're dealing with a column of lists. We can improve performance a bit by exploding the rows first. Then use factorize and return to the original format:

v = df['genre'].explode()
v[:] = pd.factorize(v)[0] + 1
df['genre2'] = v.groupby(level=0).agg(list)

df
                               genre        genre2
0    [Comedy, Supernatural, Romance]     [1, 2, 3]
1          [Comedy, Parody, Romance]     [1, 4, 3]
2                           [Comedy]           [1]
3  [Comedy, Drama, Romance, Fantasy]  [1, 5, 3, 6]
4           [Comedy, Drama, Romance]     [1, 5, 3]

answered Nov 15, 2020 at 22:27

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Cainã Max Couto da Silva · Accepted Answer · 2020-11-15 23:47:56Z

2

Get unique IDs per genre in a dictionary:

uniq_genres = df.genre.explode().unique()
dict_genres = {genre:i+1 for i,genre in enumerate(uniq_genres)}
print(dict_genres)
{'Comedy': 1, 'Supernatural': 2, 'Romance': 3, 'Parody': 4, 'Drama': 5, 'Fantasy': 6}

Then use such dictionary to map genre-ID:

df.assign(genre_id = df.genre.apply(lambda x: [dict_genres[genre] for genre in x]))

Output:

                               genre      genre_id
0    [Comedy, Supernatural, Romance]     [1, 2, 3]
1          [Comedy, Parody, Romance]     [1, 4, 3]
2                           [Comedy]           [1]
3  [Comedy, Drama, Romance, Fantasy]  [1, 5, 3, 6]
4           [Comedy, Drama, Romance]     [1, 5, 3]

answered Nov 15, 2020 at 23:47

Cainã Max Couto da Silva

4,9691 gold badge15 silver badges39 bronze badges

Comments

nosuchthingasmagic · Accepted Answer · 2020-11-15 22:43:24Z

You can set up a global dictionary to keep track of the values and use the value in the dictionary if it exists and increment the largest value if it doesn't:

d = {} # Dictionary to assign numerical ids
maxV = 0 # Max numerical id in the dictionary

def assignId(x):
    lst = []
    global d, maxV
    for item in x:       
        if item in d:
            # Get numerical id from the dictionary.
            lst.append(d.get(item))           
        else:
            # Increment the largest numerical id in the dictionary
            # and add it to the dictionary.
            maxV += 1
            d[item] = maxV
            lst.append(maxV)
    return lst

If I apply this to the df using:

df['genre_ids'] = df['genre'].apply(assignId)

I get:

                              genre     genre_ids

0    [Comedy, Supernatural, Romance]     [1, 2, 3]
1          [Comedy, Parody, Romance]     [1, 4, 3]
2                           [Comedy]           [1]
3  [Comedy, Drama, Romance, Fantasy]  [1, 5, 3, 6]
4           [Comedy, Drama, Romance]     [1, 5, 3]

with this dictionary d:

{'Comedy': 1,
 'Supernatural': 2,
 'Romance': 3,
 'Parody': 4,
 'Drama': 5,
 'Fantasy': 6}

Collectives™ on Stack Overflow

Assign int to strings in a column of lists in pandas

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related