Merge dataframes with existing values

Question

i like to merge two columns in Pandas Dataframe with an unequal length.

I've tried many approaches with merge, concat and join but no works.

keyList = ["Clone", "Chain", "Fragment", "R0", "R1", "R2"]
dataDict = {key: [] for key in keyList}
# Example for different list length
plist1 = ["ABCD", "DJFZ", "DHRZ"]
plist2 = ["ABCD", "DJFZ", "DHRZ", "JGJZ"]

filelist = ["E2_VH_Fab_R0.fasta", "E2_VH_scFV_R0.fasta", "E2_VH_Fab_R1.fasta", "E2_VH_scFV_R1.fasta","E2_VH_Fab_R2.fasta" ]

# Subsets are:
# E1 || E2 with VH || VL with Fab || scFV with R0 || R1 || R2 

for file in enumerate(filelist):
    # Get List with emits from class function
    peptidelist = clseq.processEmits()
    # Split filename into  6 parameters, see keylist
    fileparms = datafile.split('.')[0].split('_')

    # Iterate through peptide list and add the subsets into the dict
    for peptide in peptidelist:
        dataDict.setdefault("Clone", []).append(sclone)
        dataDict.setdefault("Chain", []).append(schain)
        dataDict.setdefault("Fragment", []).append(sfragment)
        # Set other Rounds as "NaN" to equal the length
        if "R0" in sround:
            dataDict.setdefault("R0", []).append(peptide)
            dataDict.setdefault("R1", []).append("NaN")
            dataDict.setdefault("R2", []).append("NaN")
        elif "R1" in sround:
            dataDict.setdefault("R0", []).append("NaN")
            dataDict.setdefault("R1", []).append(peptide)
            dataDict.setdefault("R2", []).append("NaN")
        elif "R2" in sround:
            dataDict.setdefault("R0", []).append("NaN")
            dataDict.setdefault("R1", []).append("NaN")
            dataDict.setdefault("R2", []).append(peptide)
        else:
            dataDict.setdefault("R0", []).append("NaN")
            dataDict.setdefault("R1", []).append("NaN")
            dataDict.setdefault("R2", []).append("NaN")

    dtframe.merge(pd.DataFrame(dataDict), on=['Clone', 'Chain',  'Fragment'], how='inner')

The problem is, that i have different list length with i like to merge into one dataframe and also pad the rest with NaN.

This:

0    E2    VH      Fab  r0  nan
1    E2    VH      Fab  r0  nan
2    E2    VH      Fab  r0  nan
3    E2    VH      Fab  r0  nan
4    E2    VH      Fab  r0  nan
5    E2    VH      Fab  r0  nan

and this:

0    E2    VH      Fab  nan  r1
1    E2    VH      Fab  nan  r1
2    E2    VH      Fab  nan  r1
3    E2    VH      Fab  nan  r1
4    E2    VH      Fab  nan  r1
5    E2    VH      Fab  nan  r1
6    E2    VH      Fab  nan  r1
7    E2    VH      Fab  nan  r1

Should result in this:

0     E2    VH      Fab  r0  r1
1     E2    VH      Fab  r0  r1
2     E2    VH      Fab  r0  r1
3     E2    VH      Fab  r0  r1
4     E2    VH      Fab  r0  r1
5     E2    VH      Fab  r0  r1
6     E2    VH      Fab  nan  r1
7     E2    VH      Fab  nan  r1

Beware that all of my data fields are strings.

Do you want pd.concat([df1, df2[~df2.index.isin(df1.index)])? — Erfan
– Erfan, Commented Oct 11, 2019 at 21:18
Idk, a lot of this seems like it could be avoided if you better handled the data in the if-elif clauses — ALollz
– ALollz, Commented Oct 11, 2019 at 21:22

ALollz · Accepted Answer · 2019-10-11 21:24:50Z

1

This is combine_first. We need to set the index to the three columns you want to merge on, and then create an additional cumcount level for real data with many different groups.

df1['idx'] = df1.groupby(['Clone', 'Chain', 'Fragment']).cumcount()
df2['idx'] = df2.groupby(['Clone', 'Chain', 'Fragment']).cumcount()

df1 = df1.set_index(['Clone', 'Chain', 'Fragment', 'idx'])
df2 = df2.set_index(['Clone', 'Chain', 'Fragment', 'idx'])

df1.combine_first(df2).reset_index()
#  Clone Chain Fragment  idx   R0  R1
#0    E2    VH      Fab    0   r0  r1
#1    E2    VH      Fab    1   r0  r1
#2    E2    VH      Fab    2   r0  r1
#3    E2    VH      Fab    3   r0  r1
#4    E2    VH      Fab    4   r0  r1
#5    E2    VH      Fab    5   r0  r1
#6    E2    VH      Fab    6  NaN  r1
#7    E2    VH      Fab    7  NaN  r1

df1

  Clone Chain Fragment  R0  R1
0    E2    VH      Fab  r0 NaN
1    E2    VH      Fab  r0 NaN
2    E2    VH      Fab  r0 NaN
3    E2    VH      Fab  r0 NaN
4    E2    VH      Fab  r0 NaN
5    E2    VH      Fab  r0 NaN

df2

  Clone Chain Fragment  R0  R1
0    E2    VH      Fab NaN  r1
1    E2    VH      Fab NaN  r1
2    E2    VH      Fab NaN  r1
3    E2    VH      Fab NaN  r1
4    E2    VH      Fab NaN  r1
5    E2    VH      Fab NaN  r1
6    E2    VH      Fab NaN  r1
7    E2    VH      Fab NaN  r1

answered Oct 11, 2019 at 21:24

ALollz

59.7k7 gold badges73 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

chfirex Over a year ago

Thank you your code example looks good, but i doesnt work. Got : KeyError: "None of ['Clone', 'Chain', 'Fragment'] are in the columns"

chfirex Over a year ago

Error is: KeyError: "None of ['Clone', 'Chain', 'Fragment'] are in the columns"

ALollz Over a year ago

What are your column names?

chfirex Over a year ago

I have to create a empty dataframe on initial to push the values in (before for loop): keyList = ["Clone", "Chain", "Fragment", "R0", "R1", "R2"] pd.Dataframe(columns=keyList)

Collectives™ on Stack Overflow

Merge dataframes with existing values

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related