1

Based on this post on stack i tried the value counts function like this

df2 = df1.join(df1.genres.str.split(",").apply(pd.value_counts).fillna(0))

and it works fine apart from the fact that although my data has 22 unique genres and after the split i get 42 values, which of course are not unique. Data example:

     Action  Adventure   Casual  Design & Illustration   Early Access    Education   Free to Play    Indie   Massively Multiplayer   Photo Editing   RPG     Racing  Simulation  Software Training   Sports  Strategy    Utilities   Video Production    Web Publishing Accounting  Action  Adventure   Animation & Modeling    Audio Production    Casual  Design & Illustration   Early Access    Education   Free to Play    Indie   Massively Multiplayer   Photo Editing   RPG Racing  Simulation  Software Training   Sports  Strategy    Utilities   Video Production    Web Publishing  nan
0   nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan

(i have pasted the head and the first row only)

I have a feeling that the problem is caused from my original data.Well, my column (genres) was a list of lists which contained brackets

example :[Action,Indie] so when python was reading it, it would read [Action and Action and Action] as different values and the output was 303 different values. So what i did is that:

for i in df1['genres'].tolist():
if str(i) != 'nan':

    i = i[1:-1]
    new.append(i)
else:
    new.append('nan')
9
  • You can try: if str(i).notnull(): Commented Dec 4, 2015 at 13:41
  • 1
    Can you show me your input data df1, 5 - 6 rows? Commented Dec 4, 2015 at 14:00
  • But I think you can use: print df['genres'].str.get_dummies(sep=',') Commented Dec 4, 2015 at 14:02
  • Ok i have found the problem, but i am not sure how to solve it. My header data, meaning the genres has issues with spaces. Meaning that Action appears as [space]Action , Action , Action(space) Commented Dec 5, 2015 at 15:57
  • 1
    remove this space is possible by function strip() Commented Dec 5, 2015 at 16:01

1 Answer 1

1

You have to remove first and last [] from column genres by function str.strip and then replace spaces by empty string by function str.replace

import pandas as pd

df = pd.read_csv('test/Copy of AppCrawler.csv', sep="\t")


df['genres'] = df['genres'].str.strip('[]')
df['genres'] = df['genres'].str.replace(' ', '')

df = df.join(df.genres.str.split(",").apply(pd.value_counts).fillna(0))

#temporaly display 30 rows and 60 columns
with pd.option_context('display.max_rows', 30, 'display.max_columns', 60):
    print df
    #remove for clarity
print df.columns
Index([u'Unnamed: 0', u'appid', u'currency', u'final_price', u'genres',
       u'initial_price', u'is_free', u'metacritic', u'release_date',
       u'Accounting', u'Action', u'Adventure', u'Animation&Modeling',
       u'AudioProduction', u'Casual', u'Design&Illustration', u'EarlyAccess',
       u'Education', u'FreetoPlay', u'Indie', u'MassivelyMultiplayer',
       u'PhotoEditing', u'RPG', u'Racing', u'Simulation', u'SoftwareTraining',
       u'Sports', u'Strategy', u'Utilities', u'VideoProduction',
       u'WebPublishing'],
      dtype='object')
Sign up to request clarification or add additional context in comments.

2 Comments

Just what i needed! I dont understand what you are doing with the "with" statement. Couldn't you just print df?
Maybe In 19 better explains.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.