0

I am trying to join multiple columns using pandas and the separator is defined into another column. The problem I am facing is to avoid the separator for cells which are blank.

The sample code for my attempt and the output to explain the problem is given below:

import pandas as pd
df = pd.DataFrame({'col_1': ['', '1', '1', '2', '2', '3', '3', '4', '', '4', '5', '5', '5', '5', '5', '5'],
                   'col_2': ['A', '', 'C', 'A', '', 'C', 'D', 'D', 'A', 'A', 'B', 'E', 'F', 'G', 'H', 'I'],
                   'col_3': ['256', '546', '985', '573', '265', '731', '968', '592', '364', '', '953', '476', '835',
                             '', '572', '903'],
                   'col_4': ['.', '.', '.', '-', '_', '_', '-', '.', '.', '/', '/', '.', '_', '_', '-', '.']})

df['concatenated'] = df['col_1'] + df['col_4'] + df['col_2'] + df['col_4'] + df['col_3']
print(df)

The output I am getting is:

     col_1 col_2 col_3 col_4    concatenated
0            A   256     .       .A.256
1      1         546     .       1..546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _       2__265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .       .A.364
9      4     A           /         4/A/
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _         5_G_
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

But the expected output is:

     col_1 col_2 col_3 col_4   concatenated
0            A   256     .      A.256
1      1         546     .      1.546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _      2_265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .      A.364
9      4     A           /      4/A
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _      5_G
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

The actual data contains many more columns but I need to join only selective columns.

Can anyone help me to find out the solution or guide me in the right direction?

0

3 Answers 3

1

You can use str.strip() to remove the extra separators at both ends and also str.replace() to remove repeated consecutive separators, as follows:

import re
sep = list(map(re.escape, df['col_4'].unique()))
sep_regex = '|'.join(sep)

df['concatenated'] = (df['concatenated'].str.strip(sep_regex)
                                        .str.replace(fr'({sep_regex})\1', r'\1', regex=True)
                     )

Result:

print(df)

   col_1 col_2 col_3 col_4 concatenated
0            A   256     .        A.256
1      1         546     .        1.546
2      1     C   985     .      1.C.985
3      2     A   573     -      2-A-573
4      2         265     _        2_265
5      3     C   731     _      3_C_731
6      3     D   968     -      3-D-968
7      4     D   592     .      4.D.592
8            A   364     .        A.364
9      4     A           /          4/A
10     5     B   953     /      5/B/953
11     5     E   476     .      5.E.476
12     5     F   835     _      5_F_835
13     5     G           _          5_G
14     5     H   572     -      5-H-572
15     5     I   903     .      5.I.903

Explanation:

Here, we created a list of unique sysmbols in col_4 and escaped these characters if they are regex meta-characters by using re.escape.

print(sep)

['\\.', '\\-', '_', '/']

Also, in order to match with these characters in str.strip() and str.replace(), we further make a regex expression listing these possible alternatives:

These are the | (that is "or") of those escaped separators above:

print(sep_regex)

'\\.|\\-|_|/'

We used regex back-referencing \1 to detect repeated consecutive characters and replace them with single occurrence of these characters.

Sign up to request clarification or add additional context in comments.

3 Comments

SeaBean, thanks for your reply. I was testing your code with the actual data which has around 39k rows. Your code gets stuck on the strip and replace line. Is there any way I can share the actual csv file with you to find out the problem?
@prem Is there any error message you got ?
My Bad, I mistakenly passed the wrong column in the separator list. Your code is working absolutely fine. Thanks a lot for your help.
0

solution (assuming df does not have concatenated column)

df.apply(lambda row: row[-1].join([x for x in row[:-1] if x != '']), axis=1)

This works by applying a function to each row, which gets the last element, and uses it as the separator in a call to string join, joining all but the last element, which are not equal to "".

1 Comment

I have more columns in my actual data and I am joining only selective columns. This solution will not work for the selective columns. If somehow I could define the columns to concat and the column containing delimiter, that should work for me.
0

This solution might work for you:

df['concat'] = ''
for row in df.iterrows():
    index = row[0]
    values = row[1]
    vals = [x for x in values if x != '']
    df.loc[index, 'concat'] = (vals[-1].join(vals[:-1]))
df

Result:

   col_1 col_2 col_3 col_4   concat
0            A   256     .    A.256
1      1         546     .    1.546
2      1     C   985     .  1.C.985
3      2     A   573     -  2-A-573
4      2         265     _    2_265
5      3     C   731     _  3_C_731
6      3     D   968     -  3-D-968
7      4     D   592     .  4.D.592
8            A   364     .    A.364
9      4     A           /      4/A
10     5     B   953     /  5/B/953
11     5     E   476     .  5.E.476
12     5     F   835     _  5_F_835
13     5     G           _      5_G
14     5     H   572     -  5-H-572
15     5     I   903     .  5.I.903

1 Comment

Rene, thanks for your reply. In my actual data there are several columns and I need to join only the selective columns so this solution will not work. If you can modify it for the required columns, it may help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.