Pandas: Concatenate multiple columns using another separator column and avoid extra separators for blank values

Question

I am trying to join multiple columns using pandas and the separator is defined into another column. The problem I am facing is to avoid the separator for cells which are blank.

The sample code for my attempt and the output to explain the problem is given below:

import pandas as pd
df = pd.DataFrame({'col_1': ['', '1', '1', '2', '2', '3', '3', '4', '', '4', '5', '5', '5', '5', '5', '5'],
                   'col_2': ['A', '', 'C', 'A', '', 'C', 'D', 'D', 'A', 'A', 'B', 'E', 'F', 'G', 'H', 'I'],
                   'col_3': ['256', '546', '985', '573', '265', '731', '968', '592', '364', '', '953', '476', '835',
                             '', '572', '903'],
                   'col_4': ['.', '.', '.', '-', '_', '_', '-', '.', '.', '/', '/', '.', '_', '_', '-', '.']})

df['concatenated'] = df['col_1'] + df['col_4'] + df['col_2'] + df['col_4'] + df['col_3']
print(df)

The output I am getting is:

     col_1 col_2 col_3 col_4    concatenated
0            A   256     .       .A.256
1      1         546     .       1..546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _       2__265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .       .A.364
9      4     A           /         4/A/
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _         5_G_
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

But the expected output is:

     col_1 col_2 col_3 col_4   concatenated
0            A   256     .      A.256
1      1         546     .      1.546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _      2_265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .      A.364
9      4     A           /      4/A
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _      5_G
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

The actual data contains many more columns but I need to join only selective columns.

Can anyone help me to find out the solution or guide me in the right direction?

SeaBean · Accepted Answer · 2021-10-24 12:04:10Z

1

You can use str.strip() to remove the extra separators at both ends and also str.replace() to remove repeated consecutive separators, as follows:

import re
sep = list(map(re.escape, df['col_4'].unique()))
sep_regex = '|'.join(sep)

df['concatenated'] = (df['concatenated'].str.strip(sep_regex)
                                        .str.replace(fr'({sep_regex})\1', r'\1', regex=True)
                     )

Result:

print(df)

   col_1 col_2 col_3 col_4 concatenated
0            A   256     .        A.256
1      1         546     .        1.546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _        2_265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .        A.364
9      4     A           /          4/A
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _          5_G
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

Explanation:

Here, we created a list of unique sysmbols in col_4 and escaped these characters if they are regex meta-characters by using re.escape.

print(sep)

['\\.', '\\-', '_', '/']

Also, in order to match with these characters in str.strip() and str.replace(), we further make a regex expression listing these possible alternatives:

These are the | (that is "or") of those escaped separators above:

print(sep_regex)

'\\.|\\-|_|/'

We used regex back-referencing \1 to detect repeated consecutive characters and replace them with single occurrence of these characters.

answered Oct 24, 2021 at 12:04

SeaBean

23.4k3 gold badges16 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

prem Over a year ago

SeaBean, thanks for your reply. I was testing your code with the actual data which has around 39k rows. Your code gets stuck on the strip and replace line. Is there any way I can share the actual csv file with you to find out the problem?

SeaBean Over a year ago

@prem Is there any error message you got ?

prem Over a year ago

My Bad, I mistakenly passed the wrong column in the separator list. Your code is working absolutely fine. Thanks a lot for your help.

Riley · Accepted Answer · 2021-10-24 11:58:46Z

0

solution (assuming df does not have concatenated column)

df.apply(lambda row: row[-1].join([x for x in row[:-1] if x != '']), axis=1)

This works by applying a function to each row, which gets the last element, and uses it as the separator in a call to string join, joining all but the last element, which are not equal to "".

answered Oct 24, 2021 at 11:58

Riley

2,2801 gold badge8 silver badges18 bronze badges

1 Comment

prem Over a year ago

I have more columns in my actual data and I am joining only selective columns. This solution will not work for the selective columns. If somehow I could define the columns to concat and the column containing delimiter, that should work for me.

René · Accepted Answer · 2021-10-24 13:58:37Z

0

This solution might work for you:

df['concat'] = ''
for row in df.iterrows():
    index = row[0]
    values = row[1]
    vals = [x for x in values if x != '']
    df.loc[index, 'concat'] = (vals[-1].join(vals[:-1]))
df

Result:

   col_1 col_2 col_3 col_4   concat
0            A   256     .    A.256
1      1         546     .    1.546
2      1     C   985     .  1.C.985
3      2     A   573     -  2-A-573
4      2         265     _    2_265
5      3     C   731     _  3_C_731
6      3     D   968     -  3-D-968
7      4     D   592     .  4.D.592
8            A   364     .    A.364
9      4     A           /      4/A
10     5     B   953     /  5/B/953
11     5     E   476     .  5.E.476
12     5     F   835     _  5_F_835
13     5     G           _      5_G
14     5     H   572     -  5-H-572
15     5     I   903     .  5.I.903

answered Oct 24, 2021 at 13:58

René

4,9195 gold badges29 silver badges59 bronze badges

1 Comment

prem Over a year ago

Rene, thanks for your reply. In my actual data there are several columns and I need to join only the selective columns so this solution will not work. If you can modify it for the required columns, it may help.

Collectives™ on Stack Overflow

Pandas: Concatenate multiple columns using another separator column and avoid extra separators for blank values

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related