Combining columns of dataframe based on value in another column

Question

Input df(example)

Country     SubregionA      SubregionB
BRA         State of Acre   BrasilÃ©ia
BRA         State of Acre   Cruzeiro do Sul
USA         AL              Bibb County
USA         AL              Blount County
USA         AL              Bullock County

Output df

Country     SubregionA      SubregionB
BRA         State of Acre   State of Acre - BrasilÃ©ia
BRA         State of Acre   State of Acre - Cruzeiro do Sul
USA         AL              AL Bibb County
USA         AL              AL Blount County
USA         AL              AL Bullock County

The code snippet is quite self explanatory, but when executed seems to run forever. What could be going wrong(Also the dataframe 'data' is quite large around 250K+ rows)

for row in data.itertuples():
     region = data['Country']

     if region == 'ARG' :
          data['SubregionB'] = data[['SubregionA' 'SubregionB']].apply(lambda row: '-'.join(row.values.astype(str)), axis=1)
     elif region == 'BRA' :
          data['SubregionB'] = data[['SubregionA', 'SubregionB']].apply(lambda row: '-'.join(row.values.astype(str)), axis=1)
     elif region == 'USA':
          data['SubregionB'] = data[['SubregionA', 'SubregionB']].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
     else:
          pass

Explanation : Trying to join columns SubregionA and SubregionB based on values in the column name 'Country'. The separators are different and thus have written multiple if-else statements. Takes too long to execute, how can I make this faster?

Can you add some data sample, minimal, complete, and verifiable example ? — jezrael
– jezrael, Commented Oct 30, 2020 at 10:01
@jezrael Actually, it's just two separator for now '-'(Hyphen) and ' '(whitespace) — RoshanShah22
– RoshanShah22, Commented Oct 30, 2020 at 10:09
OK, is possible specify which region has separator -, which ' ' ? Some regions are not processing? — jezrael
– jezrael, Commented Oct 30, 2020 at 10:10
@jezrael Added an example, let me know if you need more info. Separators only need to be specified for some of the regions, no transformations should be done on the rest of the regions. Transformation needed only for 'ARG', 'BRA', 'USA' — RoshanShah22
– RoshanShah22, Commented Oct 30, 2020 at 10:17

jezrael · Accepted Answer · 2020-10-30 10:22:28Z

You can use numpy.select with Series.isin and join columns with +:

print (df)
  Country     SubregionA       SubregionB
0     BRA  State of Acre         Brasilia
1     BRA  State of Acre  Cruzeiro do Sul
2     USA             AL      Bibb County
3     USA             AL    Blount County
4     USA             AL   Bullock County
5     JAP            AAA             BBBB

reg1 = ['ARG','BRA']
reg2 = ['USA']

a = np.select([df['Country'].isin(reg1), df['Country'].isin(reg2)], 
              [df['SubregionA'] + ' - ' + df['SubregionB'],
               df['SubregionA'] + ' ' + df['SubregionB']],
              default=df['SubregionB'])

df['SubregionB'] = a
print (df)
  Country     SubregionA                       SubregionB
0     BRA  State of Acre         State of Acre - Brasilia
1     BRA  State of Acre  State of Acre - Cruzeiro do Sul
2     USA             AL                   AL Bibb County
3     USA             AL                 AL Blount County
4     USA             AL                AL Bullock County
5     JAP            AAA                             BBBB

Collectives™ on Stack Overflow

Combining columns of dataframe based on value in another column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related