2

I have a dataframe that looks like this:

df=pd.read_csv('https://raw.githubusercontent.com/amanaroratc/hello-world/master/ask_git.csv')

    Channel_ID  Video_Category_Name score_pct
0   UC--bUZc5c9WseZNqGR6KLxA    Autos & Vehicles    0.213702
1   UC--bUZc5c9WseZNqGR6KLxA    Entertainment   0.786298
2   UC-B1L3oT81XgeeGh6S12qgQ    People & Blogs  1.000000
3   UC-N_7HFKrSsYxCSA_kfdRSA    People & Blogs  0.137261
4   UC-N_7HFKrSsYxCSA_kfdRSA    Pets & Animals  0.862739
... ... ... ...
819 UCzsNLZ9GrGXRjt0QmvWFm2Q    Entertainment   0.945243
820 UCzsNLZ9GrGXRjt0QmvWFm2Q    Film & Animation    0.002046
821 UCzsNLZ9GrGXRjt0QmvWFm2Q    Music   0.002797
822 UCzsNLZ9GrGXRjt0QmvWFm2Q    News & Politics 0.000433
823 UCzsNLZ9GrGXRjt0QmvWFm2Q    People & Blogs  0.000358

There are 15 distinct values in Video_Category_Name:

df.Video_Category_Name.unique()

gives

array(['Autos & Vehicles', 'Entertainment', 'People & Blogs',
       'Pets & Animals', 'Howto & Style', 'Education', 'Gaming', 'Music',
       'Comedy', 'Travel & Events', 'Science & Technology',
       'Nonprofits & Activism', 'Sports', 'Film & Animation',
       'News & Politics'], dtype=object)
 In [3]: iwantthis
  Out[3]:
     Channel_ID  Autos & Vehicles Entertainment People & Blogs ...
  0  UC--bUZc5c9WseZNqGR6KLxA  0.213702 0.786298 0 ...
  1  UC-B1L3oT81XgeeGh6S12qgQ  0        0        1.0000 ...

How do I create a column for each of these 15 and fill value from score_pct (0 if it does not exist)? Not sure how to use unstack/melt/pivot or something else

4
  • Would be really helpful if you provided input and desired output as code or in copy-pasteable form. You'd already have an answer from me if you didn't expect me to type in the data from your picture by hand. Commented Mar 4, 2022 at 9:18
  • You can check out this post for help with asking pandas questions. Commented Mar 4, 2022 at 9:20
  • @timgeb fixed, with link. Thanks for your input Commented Mar 4, 2022 at 9:32
  • That df has 824 rows. The post I linked to suggests 5. You still did not provide the output you want. Commented Mar 4, 2022 at 9:34

1 Answer 1

2

I think pivot() is the right function for your problem. It takes the categorial values of Video_Category_Name and creates new columns, which are filled with the value of score_pct. Non existing values are replaced by zero with `filna(0):

df = df.pivot(index='Channel_ID', columns='Video_Category_Name', values='score_pct').fillna(0).reset_index()

Output:

Video_Category_Name Channel_ID  Autos & Vehicles    Comedy  Education   Entertainment   Film & Animation    Gaming  Howto & Style   Music   News & Politics Nonprofits & Activism   People & Blogs  Pets & Animals  Science & Technology    Sports  Travel & Events
0   UC--bUZc5c9WseZNqGR6KLxA    0.213702    0.0 0.0 0.786298    0.0 0.0 0.0 0.0 0.0 0.0 0.000000    0.000000    0.0 0.0 0.0
1   UC-B1L3oT81XgeeGh6S12qgQ    0.000000    0.0 0.0 0.000000    0.0 0.0 0.0 0.0 0.0 0.0 1.000000    0.000000    0.0 0.0 0.0
2   UC-N_7HFKrSsYxCSA_kfdRSA    0.000000    0.0 0.0 0.000000    0.0 0.0 0.0 0.0 0.0 0.0 0.137261    0.862739    0.0 0.0 0.0
3   UC-T4JheeuNl2DVg-B-v7McA    0.000000    0.0 0.0 0.000000    0.0 0.0 1.0 0.0 0.0 0.0 0.000000    0.000000    0.0 0.0 0.0
4   UC-WG1VP4am6NaUtANEJxRQw    0.000000    0.0 0.0 0.000000    0.0 0.0 0.0 0.0 0.0 0.0 1.000000    0.000000    0.0 0.0 0.0

Edit 1: As the comment mentions this only works if Channel_ID is unique. If it is not (or to be safe) you can also include the index in the pivot operation. Afterwards restore the index again:

df = df.reset_index().pivot(index=['index', 'Channel_ID'], columns='Video_Category_Name', values='score_pct').fillna(0).reset_index(level=1)

Edit 2: The Video_Category_Name in the dataframe is only the label of the columns and should not change anything. However, you can easily remove it with this line:

df = df.rename_axis(None, axis=1) 

For the solution of Edit 1 you might also want to remove the index name, which can be done with the same operation and a different axis:

df = df.rename_axis(None, axis=0)
Sign up to request clarification or add additional context in comments.

4 Comments

Note that this will raise a ValueError if there are any index, columns combinations with multiple values. Does not happen for OP's sample data though.
@JANO, how do I remove that "Video_Category_Name" above the indices though?
@timgeb You`re right! I added a solution that uses the index as well.
@AmanArora I added a second edit with the line that removes "Video_Category_Name"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.