3

I have a Pandas dataframe that contains a column with lists of strings.

>>> df.head()

   genre
0  [Comedy,  Supernatural,  Romance]
1  [Comedy,  Parody,  Romance]
2  [Comedy]
3  [Comedy,  Drama,  Romance,  Fantasy]
4  [Comedy,  Drama,  Romance]

How could I go about assigning each of the values in the list a unique id that would be the same across the column?

>>> df.head()

   genre
0  [1,  2,  3]
1  [1,  4,  3]
2  [1]
3  [1,  5,  3,  6]
4  [1,  5,  3]

3 Answers 3

3

The complication here is we're dealing with a column of lists. We can improve performance a bit by exploding the rows first. Then use factorize and return to the original format:

v = df['genre'].explode()
v[:] = pd.factorize(v)[0] + 1
df['genre2'] = v.groupby(level=0).agg(list)

df
                               genre        genre2
0    [Comedy, Supernatural, Romance]     [1, 2, 3]
1          [Comedy, Parody, Romance]     [1, 4, 3]
2                           [Comedy]           [1]
3  [Comedy, Drama, Romance, Fantasy]  [1, 5, 3, 6]
4           [Comedy, Drama, Romance]     [1, 5, 3]
Sign up to request clarification or add additional context in comments.

Comments

2

Get unique IDs per genre in a dictionary:

uniq_genres = df.genre.explode().unique()
dict_genres = {genre:i+1 for i,genre in enumerate(uniq_genres)}
print(dict_genres)
{'Comedy': 1, 'Supernatural': 2, 'Romance': 3, 'Parody': 4, 'Drama': 5, 'Fantasy': 6}

Then use such dictionary to map genre-ID:

df.assign(genre_id = df.genre.apply(lambda x: [dict_genres[genre] for genre in x]))

Output:

                               genre      genre_id
0    [Comedy, Supernatural, Romance]     [1, 2, 3]
1          [Comedy, Parody, Romance]     [1, 4, 3]
2                           [Comedy]           [1]
3  [Comedy, Drama, Romance, Fantasy]  [1, 5, 3, 6]
4           [Comedy, Drama, Romance]     [1, 5, 3]

Comments

0

You can set up a global dictionary to keep track of the values and use the value in the dictionary if it exists and increment the largest value if it doesn't:

d = {} # Dictionary to assign numerical ids
maxV = 0 # Max numerical id in the dictionary

def assignId(x):
    lst = []
    global d, maxV
    for item in x:       
        if item in d:
            # Get numerical id from the dictionary.
            lst.append(d.get(item))           
        else:
            # Increment the largest numerical id in the dictionary
            # and add it to the dictionary.
            maxV += 1
            d[item] = maxV
            lst.append(maxV)
    return lst

If I apply this to the df using:

df['genre_ids'] = df['genre'].apply(assignId)

I get:

                              genre     genre_ids

0    [Comedy, Supernatural, Romance]     [1, 2, 3]
1          [Comedy, Parody, Romance]     [1, 4, 3]
2                           [Comedy]           [1]
3  [Comedy, Drama, Romance, Fantasy]  [1, 5, 3, 6]
4           [Comedy, Drama, Romance]     [1, 5, 3]

with this dictionary d:

{'Comedy': 1,
 'Supernatural': 2,
 'Romance': 3,
 'Parody': 4,
 'Drama': 5,
 'Fantasy': 6}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.