How to merge duplicate column and sum their value?
What I have
A 30
A 40
B 50
What I need
A 70
B 50
DF for this example
d = {'address': ["A", "A", "B"], 'balances': [30, 40, 50]}
df = pd.DataFrame(data=d)
df
In another case when you have a dataset with several duplicated columns and you wouldn't want to select them separately use:
df.groupby(by=df.columns, axis=1).sum()
df.T.groupby(by=df.columns).sum().transpose() for pandas newer versions. df.groupby(axis=1) is deprecated since pandas v2.1.0. For axis=1, do frame.T.groupby(...) instead.
$\endgroup$
You may use
df2 = df.groupby(['address']).sum()
or
df2 = df.groupby(['address']).agg('sum')
If there are columns other than balances that you want to peak only the first or max value, or do mean instead of sum, you can go as follows:
d = {'address': ["A", "A", "B"], 'balances': [30, 40, 50], 'sessions': [2, 3, 4]}
df = pd.DataFrame(data=d)
df2 = df.groupby(['address']).agg({'balances': 'sum', 'sessions': 'mean'})
That outputs
balances sessions
address
A 70 2.5
B 50 4.0
You may add as_index = False to groupby arguments to have:
address balances sessions
0 A 70 2.5
1 B 50 4.0
adresses the sessions value is the same and you just want to keep it? {'address': ["A", "A", "B"], 'balances': [30, 40, 50], 'sessions': ["V","V","K"]} to {'address': ["A", "B"], 'balances': [70, 50], 'sessions': ["V","K"]}
$\endgroup$
'sessions': 'first' which keeps one of two (or more) "V"s
$\endgroup$