How can I add new columns using another dataframe (related to string columns) in Pandas

Question

Confusing title, let me explain. I have 2 dataframes like this:

dataframe named df1: Looks like this (with million of rows in original):

id `  text                             c1      
1     Hello world how are you people    1 
2     Hello people I am fine  people    1
3     Good Morning people               -1
4     Good Evening                      -1

Dataframe named df2 looks like this:

Word      count         Points         Percentage

hello        2             2              100
world        1             1              100
how          1             1              100
are          1             1              100
you          1             1              100
people       3             1              33.33
I            1             1              100
am           1             1              100
fine         1             1              100
Good         2             -2            -100
Morning      1             -1            -100
Evening      1             -1            -100
                -1

df2 columns explaination:

count means the total number of times that word appeared in df1

points is points given to each word by some kind of algorithm

percentage = points/count*100

Now, I want to add 40 new columns in df1, according to the point & percentage. They will look like this:

perc_-90_2 perc_-80_2 perc_-70_2 perc_-60_2 perc_-50_2 perc_-40_2 perc_-20_2 perc_-10_2 perc_0_2 perc_10_2 perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 perc_80_2 perc_90_2

perc_-90_1 perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 perc_-20_1 perc_-10_1 perc_0_1 perc_10_1 perc_20_1 perc_30_1 perc_40_1 perc_50_1 perc_60_ perc_70_1 perc_80_1 perc_90_1

Let me break it down. The column name contain 3 parts:

1.) perc just a string, means nothing

2.) Numbers from range -90 to +90. For example, Here -90 means, the percentage is -90 in df2. Now for example, If a word has percentage value in range 81-90, then there will be a value of 1 in that row, and column named prec_-80_xx. The xx is the third part.

3.) The third part is the count. Here I want two type of counts. 1 and 2. As the example given in point 2, If the word count is in range of 0 to 1, then the value will be 1 in prec_-80_1 column. If the word count is 2 or more, then the value will be 1 in prec_-80_2 column.

I hope it is not very on confusing.

What is filled in new columns? 0 and 1 ? Also for 33.33 is value perc_30_2 ? — jezrael
– jezrael, Commented May 7, 2019 at 13:19

jezrael · Accepted Answer · 2019-05-07 15:27:07Z

Use:

#change previous answer with add id for matching
df2 = (df.drop_duplicates(['id','Word'])
         .groupby('Word', sort=False)
         .agg({'c1':['sum','size'], 'id':'first'})
         )
df2.columns = df2.columns.map(''.join)
df2 = df2.reset_index()
df2 = df2.rename(columns={'c1sum':'Points','c1size':'Totalcount','idfirst':'id'})

df2['Percentage'] = df2['Points'] / df2['Totalcount'] * 100


s1 = df2['Percentage'].div(10).astype(int).mul(10).astype(str)
s2 = np.where(df2['Totalcount'] == 1, '1', '2')
#s2= np.where(df1['Totalcount'].isin([0,1]), '1', '2')
#create colum by join
df2['new'] = 'perc_' + s1 + '_' +s2

#create indicator DataFrame
df3 = pd.get_dummies(df2[['id','new']].drop_duplicates().set_index('id'), 
                     prefix='', 
                     prefix_sep='').max(level=0)
print (df3)

#reindex for add missing columns
c = 'perc_' + pd.Series(np.arange(-100, 110, 10).astype(str)) + '_'
cols = (c + '1').append(c + '2')
#join to original df1
df = df1.join(df3.reindex(columns=cols, fill_value=0), on='id')

print (df)
   id                            text  c1  perc_-100_1  perc_-90_1  \
0   1  Hello world how are you people   1            0           0   
1   2   Hello people I am fine people   1            0           0   
2   3             Good Morning people  -1            1           0   
3   4                    Good Evening  -1            1           0   

   perc_-80_1  perc_-70_1  perc_-60_1  perc_-50_1  perc_-40_1  ...  perc_10_2  \
0           0           0           0           0           0  ...          0   
1           0           0           0           0           0  ...          0   
2           0           0           0           0           0  ...          0   
3           0           0           0           0           0  ...          0   

   perc_20_2  perc_30_2  perc_40_2  perc_50_2  perc_60_2  perc_70_2  \
0          0          1          0          0          0          0   
1          0          0          0          0          0          0   
2          0          0          0          0          0          0   
3          0          0          0          0          0          0   

   perc_80_2  perc_90_2  perc_100_2  
0          0          0           1  
1          0          0           0  
2          0          0           0  
3          0          0           0  

[4 rows x 45 columns]

Collectives™ on Stack Overflow

How can I add new columns using another dataframe (related to string columns) in Pandas

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related