3

Confusing title, let me explain. I have 2 dataframes like this:

dataframe named df1: Looks like this (with million of rows in original):

id `  text                             c1      
1     Hello world how are you people    1 
2     Hello people I am fine  people    1
3     Good Morning people               -1
4     Good Evening                      -1

Dataframe named df2 looks like this:

Word      count         Points         Percentage

hello        2             2              100
world        1             1              100
how          1             1              100
are          1             1              100
you          1             1              100
people       3             1              33.33
I            1             1              100
am           1             1              100
fine         1             1              100
Good         2             -2            -100
Morning      1             -1            -100
Evening      1             -1            -100
                -1

df2 columns explaination:

count means the total number of times that word appeared in df1

points is points given to each word by some kind of algorithm

percentage = points/count*100

Now, I want to add 40 new columns in df1, according to the point & percentage. They will look like this:

perc_-90_2 perc_-80_2 perc_-70_2 perc_-60_2 perc_-50_2 perc_-40_2 perc_-20_2 perc_-10_2 perc_0_2 perc_10_2 perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 perc_80_2 perc_90_2

perc_-90_1 perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 perc_-20_1 perc_-10_1 perc_0_1 perc_10_1 perc_20_1 perc_30_1 perc_40_1 perc_50_1 perc_60_ perc_70_1 perc_80_1 perc_90_1

Let me break it down. The column name contain 3 parts:

1.) perc just a string, means nothing

2.) Numbers from range -90 to +90. For example, Here -90 means, the percentage is -90 in df2. Now for example, If a word has percentage value in range 81-90, then there will be a value of 1 in that row, and column named prec_-80_xx. The xx is the third part.

3.) The third part is the count. Here I want two type of counts. 1 and 2. As the example given in point 2, If the word count is in range of 0 to 1, then the value will be 1 in prec_-80_1 column. If the word count is 2 or more, then the value will be 1 in prec_-80_2 column.

I hope it is not very on confusing.

2
  • What is filled in new columns? 0 and 1 ? Also for 33.33 is value perc_30_2 ? Commented May 7, 2019 at 13:19
  • Yes, 0 and 1. Yes the value 33.33 will be in perc_30_2 Commented May 7, 2019 at 14:29

1 Answer 1

2

Use:

#change previous answer with add id for matching
df2 = (df.drop_duplicates(['id','Word'])
         .groupby('Word', sort=False)
         .agg({'c1':['sum','size'], 'id':'first'})
         )
df2.columns = df2.columns.map(''.join)
df2 = df2.reset_index()
df2 = df2.rename(columns={'c1sum':'Points','c1size':'Totalcount','idfirst':'id'})

df2['Percentage'] = df2['Points'] / df2['Totalcount'] * 100


s1 = df2['Percentage'].div(10).astype(int).mul(10).astype(str)
s2 = np.where(df2['Totalcount'] == 1, '1', '2')
#s2= np.where(df1['Totalcount'].isin([0,1]), '1', '2')
#create colum by join
df2['new'] = 'perc_' + s1 + '_' +s2

#create indicator DataFrame
df3 = pd.get_dummies(df2[['id','new']].drop_duplicates().set_index('id'), 
                     prefix='', 
                     prefix_sep='').max(level=0)
print (df3)

#reindex for add missing columns
c = 'perc_' + pd.Series(np.arange(-100, 110, 10).astype(str)) + '_'
cols = (c + '1').append(c + '2')
#join to original df1
df = df1.join(df3.reindex(columns=cols, fill_value=0), on='id')

print (df)
   id                            text  c1  perc_-100_1  perc_-90_1  \
0   1  Hello world how are you people   1            0           0   
1   2   Hello people I am fine people   1            0           0   
2   3             Good Morning people  -1            1           0   
3   4                    Good Evening  -1            1           0   

   perc_-80_1  perc_-70_1  perc_-60_1  perc_-50_1  perc_-40_1  ...  perc_10_2  \
0           0           0           0           0           0  ...          0   
1           0           0           0           0           0  ...          0   
2           0           0           0           0           0  ...          0   
3           0           0           0           0           0  ...          0   

   perc_20_2  perc_30_2  perc_40_2  perc_50_2  perc_60_2  perc_70_2  \
0          0          1          0          0          0          0   
1          0          0          0          0          0          0   
2          0          0          0          0          0          0   
3          0          0          0          0          0          0   

   perc_80_2  perc_90_2  perc_100_2  
0          0          0           1  
1          0          0           0  
2          0          0           0  
3          0          0           0  

[4 rows x 45 columns]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.