0

i like to merge two columns in Pandas Dataframe with an unequal length.

I've tried many approaches with merge, concat and join but no works.

keyList = ["Clone", "Chain", "Fragment", "R0", "R1", "R2"]
dataDict = {key: [] for key in keyList}
# Example for different list length
plist1 = ["ABCD", "DJFZ", "DHRZ"]
plist2 = ["ABCD", "DJFZ", "DHRZ", "JGJZ"]

filelist = ["E2_VH_Fab_R0.fasta", "E2_VH_scFV_R0.fasta", "E2_VH_Fab_R1.fasta", "E2_VH_scFV_R1.fasta","E2_VH_Fab_R2.fasta" ]

# Subsets are:
# E1 || E2 with VH || VL with Fab || scFV with R0 || R1 || R2 

for file in enumerate(filelist):
    # Get List with emits from class function
    peptidelist = clseq.processEmits()
    # Split filename into  6 parameters, see keylist
    fileparms = datafile.split('.')[0].split('_')

    # Iterate through peptide list and add the subsets into the dict
    for peptide in peptidelist:
        dataDict.setdefault("Clone", []).append(sclone)
        dataDict.setdefault("Chain", []).append(schain)
        dataDict.setdefault("Fragment", []).append(sfragment)
        # Set other Rounds as "NaN" to equal the length
        if "R0" in sround:
            dataDict.setdefault("R0", []).append(peptide)
            dataDict.setdefault("R1", []).append("NaN")
            dataDict.setdefault("R2", []).append("NaN")
        elif "R1" in sround:
            dataDict.setdefault("R0", []).append("NaN")
            dataDict.setdefault("R1", []).append(peptide)
            dataDict.setdefault("R2", []).append("NaN")
        elif "R2" in sround:
            dataDict.setdefault("R0", []).append("NaN")
            dataDict.setdefault("R1", []).append("NaN")
            dataDict.setdefault("R2", []).append(peptide)
        else:
            dataDict.setdefault("R0", []).append("NaN")
            dataDict.setdefault("R1", []).append("NaN")
            dataDict.setdefault("R2", []).append("NaN")

    dtframe.merge(pd.DataFrame(dataDict), on=['Clone', 'Chain',  'Fragment'], how='inner')

The problem is, that i have different list length with i like to merge into one dataframe and also pad the rest with NaN.

This:

0    E2    VH      Fab  r0  nan
1    E2    VH      Fab  r0  nan
2    E2    VH      Fab  r0  nan
3    E2    VH      Fab  r0  nan
4    E2    VH      Fab  r0  nan
5    E2    VH      Fab  r0  nan

and this:

0    E2    VH      Fab  nan  r1
1    E2    VH      Fab  nan  r1
2    E2    VH      Fab  nan  r1
3    E2    VH      Fab  nan  r1
4    E2    VH      Fab  nan  r1
5    E2    VH      Fab  nan  r1
6    E2    VH      Fab  nan  r1
7    E2    VH      Fab  nan  r1

Should result in this:

0     E2    VH      Fab  r0  r1
1     E2    VH      Fab  r0  r1
2     E2    VH      Fab  r0  r1
3     E2    VH      Fab  r0  r1
4     E2    VH      Fab  r0  r1
5     E2    VH      Fab  r0  r1
6     E2    VH      Fab  nan  r1
7     E2    VH      Fab  nan  r1

Beware that all of my data fields are strings.

3
  • Do you want pd.concat([df1, df2[~df2.index.isin(df1.index)])? Commented Oct 11, 2019 at 21:18
  • Idk, a lot of this seems like it could be avoided if you better handled the data in the if-elif clauses Commented Oct 11, 2019 at 21:22
  • @Erfan This doesnt work, gives me an empty table Commented Oct 11, 2019 at 22:06

1 Answer 1

1

This is combine_first. We need to set the index to the three columns you want to merge on, and then create an additional cumcount level for real data with many different groups.

df1['idx'] = df1.groupby(['Clone', 'Chain', 'Fragment']).cumcount()
df2['idx'] = df2.groupby(['Clone', 'Chain', 'Fragment']).cumcount()

df1 = df1.set_index(['Clone', 'Chain', 'Fragment', 'idx'])
df2 = df2.set_index(['Clone', 'Chain', 'Fragment', 'idx'])

df1.combine_first(df2).reset_index()
#  Clone Chain Fragment  idx   R0  R1
#0    E2    VH      Fab    0   r0  r1
#1    E2    VH      Fab    1   r0  r1
#2    E2    VH      Fab    2   r0  r1
#3    E2    VH      Fab    3   r0  r1
#4    E2    VH      Fab    4   r0  r1
#5    E2    VH      Fab    5   r0  r1
#6    E2    VH      Fab    6  NaN  r1
#7    E2    VH      Fab    7  NaN  r1

df1

  Clone Chain Fragment  R0  R1
0    E2    VH      Fab  r0 NaN
1    E2    VH      Fab  r0 NaN
2    E2    VH      Fab  r0 NaN
3    E2    VH      Fab  r0 NaN
4    E2    VH      Fab  r0 NaN
5    E2    VH      Fab  r0 NaN

df2

  Clone Chain Fragment  R0  R1
0    E2    VH      Fab NaN  r1
1    E2    VH      Fab NaN  r1
2    E2    VH      Fab NaN  r1
3    E2    VH      Fab NaN  r1
4    E2    VH      Fab NaN  r1
5    E2    VH      Fab NaN  r1
6    E2    VH      Fab NaN  r1
7    E2    VH      Fab NaN  r1
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you your code example looks good, but i doesnt work. Got : KeyError: "None of ['Clone', 'Chain', 'Fragment'] are in the columns"
Error is: KeyError: "None of ['Clone', 'Chain', 'Fragment'] are in the columns"
What are your column names?
I have to create a empty dataframe on initial to push the values in (before for loop): keyList = ["Clone", "Chain", "Fragment", "R0", "R1", "R2"] pd.Dataframe(columns=keyList)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.