Conditional concatenation in dataframe

Question

I have a three sets of code (code1, code 2 and code 3) having alphanumeric objects. All the codes are separated by delimiter (;) , and the codes are related based on sequence like A123 of code 1 is related to A of code 2 and A445 of code 3, and so on. Code 3 has some codes duplicated or repeated.

My desired output is to get the concatenated "code 4" where code 1 and code 2 are concatenated based on either of these two conditions

a) if the corresponding code in code 3 has no repeated value

b) if the corresponding code in code 3 has repeated value, then the position corresponding to the position of last repeated value in code 3 needs to be used for concatenating code 1 and code 2 (like B678 R4 because A445 is repeated twice in code 3, and the 4th position of A445 needs to be considered for concatenating code 1 and code 2)

Let me know if any logic can be used to get the output. Thanks in advance!

Python script for dataframe df is

df11 = pd.DataFrame({"code1": ["A123; A321; B478; B678; C567", "A321; A821; B448; B698; C577"], "code2": ["A; B5; N5; R4; H5", "A3; B; N; R7; H2"],"code3": ["A445; A323; A323; A445; A659", "A328; A328; A621; A442; A621"]},      index=[0, 1], )

Desired output along with the input codes should look like this

Nk03 · Accepted Answer · 2021-06-05 10:31:47Z

3

STEPS:

use applymap to convert each value into a list.
explode the dataframe.
strip off the extra space if any.
drop the duplicates in the df based on the code3 column and keep the last value.
drop the code3 column and join code1 & code2.
Finally, aggregate them back using groupby to get the desired output.

df2 =(
    df11.assign(
        desired_output=df11.applymap(
            lambda x: x.split(';'))
        .apply(pd.Series.explode)
        .applymap(str.strip)
        .drop_duplicates(subset='code3', keep='last')
        .drop('code3', 1)
        .apply(' '.join, 1)
        .groupby(level=0)
        .agg('; '.join))
)

UPDATED ANSWER:

df2 = (
    df11.assign(
        desired_output=
        df11.apply(lambda s: s.str.split('; ').explode().str.strip())
        .drop_duplicates(subset='code3', keep='last')
        .drop('code3', 1)
        .apply(' '.join, 1)
        .groupby(level=0)
        .agg('; '.join)
        )
)

OUTPUT:

                          code1              code2  \
0  A123; A321; B478; B678; C567  A; B5; N5; R4; H5   
1  A321; A821; B448; B698; C577   A3; B; N; R7; H2   

                          code3             desired_output  
0  A445; A323; A323; A445; A659  B478 N5; B678 R4; C567 H5  
1  A328; A328; A621; A442; A621   A821 B; B698 R7; C577 H2

edited Jun 5, 2021 at 10:31

answered Jun 5, 2021 at 5:48

Nk03

15k2 gold badges11 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

blackraven Over a year ago

wow! I would need many more years of practice to achieve such expert manipulation!

Shubham Sharma Over a year ago

@Nk03 Good answer, although you can substantially reduce the number of apply/apply maps steps, for e.g you could split the strings in a vectorized manner using df.apply(lambda s: s.str.split('; ').explode())

Nk03 Over a year ago

@ShubhamSharma I was not aware of the fact that we can use explode inside apply. Thanks !! :)

Arun Menon Over a year ago

@Nk03 I have some issue with drop duplicate command while solving similar problem, i have created a new thread and linked it (see the Linked section) .

blackraven · Accepted Answer · 2021-06-05 06:34:55Z

I have done a few manipulations:

(1) Use regular expression to extract items into a list, and reverse the list order.

(2) Find the index(s) of unique items in 'Code 3'.

(3) Concat the corresponding values in 'Code 1' and 'Code 2' based on the index(s).

import re

df = pd.DataFrame({"code1": ["A123; A321; B478; B678; C567", "A321; A821; B448; B698; C577"], "code2": ["A; B5; N5; R4; H5", "A3; B; N; R7; H2"],"code3": ["A445; A323; A323; A445; A659", "A328; A328; A621; A442; A621"]},      index=[0, 1], )
for col in df.columns:
    df[col] = df[col].apply(lambda x: re.findall(r'\w+', x)).apply(lambda x: x[::-1])

df['idx'] = df['code3'].apply(lambda x: [x.index(e) for e in set(x)])
df['code4'] = df.apply(lambda row: [row['code1'][i] + ' ' + row['code2'][i] for i in row['idx']], axis=1)

Output df

    code1                           code2               code3                           idx         code4
0   [C567, B678, B478, A321, A123]  [H5, R4, N5, B5, A] [A659, A445, A323, A323, A445]  [0, 2, 1]   [C567 H5, B478 N5, B678 R4]
1   [C577, B698, B448, A821, A321]  [H2, R7, N, B, A3]  [A621, A442, A621, A328, A328]  [0, 3, 1]   [C577 H2, A821 B, B698 R7]

Collectives™ on Stack Overflow

Conditional concatenation in dataframe

2 Answers 2

STEPS:

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

STEPS:

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related