2

I'm new to Python so I'm sorry if terminology is not correct; I've searched for similar posts but didn't find anything helpful for my case. I have a dataframe like this:

    Column1     Column2
0   0001        [('A','B'),('C','D'),('E','F')]
1   0001        [('A','B'),('C','D'),('E','F')]
2   0001        [('A','B'),('C','D'),('E','F')]
3   0002        [('G','H'),('I','J')]
4   0002        [('G','H'),('I','J')]

Each row is replicated n times based on the number of tuples contained in the list of Column2. What I'd like to do is to add a new column containing only one tuple per row:

Column1     Column2                             Column2_new
0   0001        [('A','B'),('C','D'),('E','F')]     'A' 'B'
1   0001        [('A','B'),('C','D'),('E','F')]     'C' 'D'
2   0001        [('A','B'),('C','D'),('E','F')]     'E' 'F'
3   0002        [('G','H'),('I','J')]               'G' 'H'
4   0002        [('G','H'),('I','J')]               'I' 'J'

Can you please help me with this?

Thanks in advance for any suggestion

1

2 Answers 2

2

We can do df.lookup after groupby+cumcount

idx = df.groupby('Column1').cumcount()
df['new']= pd.DataFrame(df['Column2'].tolist()).lookup(df.index,idx)

print(df)
   Column1                   Column2     new
0        1  [(A, B), (C, D), (E, F)]  (A, B)
1        1  [(A, B), (C, D), (E, F)]  (C, D)
2        1  [(A, B), (C, D), (E, F)]  (E, F)
3        2          [(G, H), (I, J)]  (G, H)
4        2          [(G, H), (I, J)]  (I, J)
Sign up to request clarification or add additional context in comments.

Comments

1
data = {'Column1':["0001"]*3 + ["0002"]*2,
        'Column2':[[('A','B'),('C','D'),('E','F')]]*3 + [[('G','H'),('I','J')]]*2
       }
df = pd.DataFrame(data)

print(df)


    Column1       Column2
0   0001    [(A, B), (C, D), (E, F)]
1   0001    [(A, B), (C, D), (E, F)]
2   0001    [(A, B), (C, D), (E, F)]
3   0002    [(G, H), (I, J)]
4   0002    [(G, H), (I, J)]

M = df.drop_duplicates('Column1')
print(M)

    Column1     Column2
0   0001    [(A, B), (C, D), (E, F)]
3   0002    [(G, H), (I, J)]

pd.concat([df,M.Column2.explode().reset_index(drop=True).rename('new')],axis=1)    

  Column1       Column2                   new
0   0001    [(A, B), (C, D), (E, F)]    (A, B)
1   0001    [(A, B), (C, D), (E, F)]    (C, D)
2   0001    [(A, B), (C, D), (E, F)]    (E, F)
3   0002    [(G, H), (I, J)]            (G, H)
4   0002    [(G, H), (I, J)]            (I, J)

Alternatively, you could use the itertools functions - product and chain to get ur data, and concat back to the original dataframe:

from itertools import product,chain
res = chain.from_iterable(product([first],last)
                          for first, last 
                          in zip(M.Column1, M.Column2))
out = pd.DataFrame(res,columns=['Column1','new'])
pd.concat((df,out.new),axis=1)

1 Comment

Thank you very much!!! :-) .explode() is exactly what I needed, it worked perfectly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.