2

Now, my dataset looks like this:

tconst  Actor1  Actor2  Actor3  Actor4  Actor5  Actor6  Actor7  Actor8  Actor9  Actor10
0   tt0000001   NaN GreaterEuropean, WestEuropean, French   GreaterEuropean, British    NaN NaN NaN NaN NaN NaN NaN
1   tt0000002   NaN GreaterEuropean, WestEuropean, French   NaN NaN NaN NaN NaN NaN NaN NaN
2   tt0000003   NaN GreaterEuropean, WestEuropean, French   GreaterEuropean, WestEuropean, French   GreaterEuropean, WestEuropean, French   NaN NaN NaN NaN NaN NaN
3   tt0000004   NaN GreaterEuropean, WestEuropean, French   NaN NaN NaN NaN NaN NaN NaN NaN
4   tt0000005   NaN GreaterEuropean, British    GreaterEuropean, British    NaN NaN NaN NaN NaN NaN NaN

I used replace and map function to get here.

Another Look to Dataset

I want to create a dataframe from the above data frames such as I can get resulting dataframe as below.

tconst  GreaterEuropean   WestEuropean   French  GreaterEuropean   British    Arab    British   ............
tt0000001   2   1   0   4   1   0   2 .....
tt0000002   0   2   4   0   1   3   0 .....

GreaterEuropean British WestEuropean Italian French ... represents number of ehnicities of different actors in a particlular movie specified by tconst.

That would be like a count matrix, such as for a movie tt00001 there are 5 Arabs, 2 British, 1 WestEuropean and so on such that in a movie, how many actors are there who belong to these ethnicities. Link to data - https://drive.google.com/open?id=1oNfbTpmLA0imPieRxGfU_cBYVfWN3tZq

4
  • 1
    What have you tried so far? Please provide a minimal reproducible example. As an example, provide some code to create your dataframe (3 rows may suffice) and show what you want as output. Commented Jan 31, 2018 at 18:25
  • @jp_data_analysis I also added link to the data and I have been various steps to reach here. what pandas functionality I should use, I am stuck upon that. Commented Jan 31, 2018 at 18:31
  • I'm afraid a picture doesn't help. Have you read: minimal reproducible example? Commented Jan 31, 2018 at 18:33
  • @jp_data_analysis I also added the data source in the link, which are basically 500 rows of the dataset, the original size is over 6 mn. I also added a resulting dataframe Commented Jan 31, 2018 at 18:37

2 Answers 2

3
import numpy as np
import pandas as pd

df_melted = pd.melt(df, id_vars = 'tconst', 
                    value_vars = df.columns[2:].tolist(), 
                    var_name = 'actor', 
                    value_name = 'ethnicities').dropna()

print(df_melted.ethnicities.str.get_dummies(sep = ',').sum())

Output:

 British               169
 EastAsian               9
 EastEuropean           17
 French                 73
 Germanic                9
 GreaterEastAsian       13
 Hispanic                9
 IndianSubContinent      2
 Italian                 7
 Japanese                4
 Jewish                 25
 Nordic                  7
 WestEuropean          105
Asian                   15
GreaterEuropean        316
dtype: int64

This is close to what you wanted, but not exact. To get what you wanted, without typing out the lists of columns or values, is more complicated.

From: https://stackoverflow.com/a/48120674/6672746

def change_column_order(df, col_name, index):
    cols = df.columns.tolist()
    cols.remove(col_name)
    cols.insert(index, col_name)
    return df[cols]

def split_df(dataframe, col_name, sep):
    orig_col_index = dataframe.columns.tolist().index(col_name)
    orig_index_name = dataframe.index.name
    orig_columns = dataframe.columns
    dataframe = dataframe.reset_index()  # we need a natural 0-based index for proper merge
    index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
    df_split = pd.DataFrame(
        pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
        .stack().reset_index(level=1, drop=1), columns=[col_name])
    df = dataframe.drop(col_name, axis=1)
    df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
    df = df.set_index(index_col_name)
    df.index.name = orig_index_name
    # merge adds the column to the last place, so we need to move it back
    return change_column_order(df, col_name, orig_col_index)

Using those excellent functions:

df_final = split_df(df_melted, 'ethnicities', ',')
df_final.set_index(['tconst', 'actor'], inplace = True)
df_final.pivot_table(index = ['tconst'], 
                     columns = 'ethnicities', 
                     aggfunc = pd.Series.count).fillna(0).astype('int')

Output:

ethnicities     British     EastAsian   EastEuropean    French  Germanic    GreaterEastAsian    Hispanic    IndianSubContinent  Italian     Japanese    Jewish  Nordic  WestEuropean    Asian   GreaterEuropean
tconst                                                          
tt0000001   1   0   0   1   0   0   0   0   0   0   0   0   1   0   2
tt0000002   0   0   0   1   0   0   0   0   0   0   0   0   1   0   1
tt0000003   0   0   0   3   0   0   0   0   0   0   0   0   3   0   3
tt0000004   0   0   0   1   0   0   0   0   0   0   0   0   1   0   1
tt0000005   2   0   0   0   0   0   0   0   0   0   0   0   0   0   2
Sign up to request clarification or add additional context in comments.

1 Comment

Although, I done it Evan. But I am really thankful you gave it and tried it. Thanks
0

Pandas have it all.

title_principals["all"] = title_principals["Actor1"].astype(str)+','+title_principals["Actor2"].astype(str)+','+title_principals["Actor3"].astype(str)+','+title_principals["Actor4"].astype(str)+','+title_principals["Actor5"].astype(str)+','+title_principals["Actor6"].astype(str)+','+title_principals["Actor7"].astype(str)+','+title_principals["Actor8"].astype(str)+','+title_principals["Actor9"].astype(str)+','+title_principals["Actor10"].astype(str)

and then, from the string, make the count and drop the other variables.

title_principals["GreaterEuropean"] = title_principals["all"].str.contains(r'GreaterEuropean').sum()

1 Comment

I'm looking for a way to do this without typing all of the column names out... have not had a ton of luck so far.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.