Python Pandas merge and update dataframe

Question

I am currently using Python and Pandas to form a stock price "database". I managed to find some codes to download the stock prices.

df1 is my existing database. Each time I download the share price, it will look like df2 and df3. Then, i need to combine df1, df2 and df3 data to look like df4.

Each stock has its own column. Each date has its own row.

df1: Existing database

+----------+-------+----------+--------+
|   Date   | Apple | Facebook | Google |
+----------+-------+----------+--------+
| 1/1/2018 |   161 |       58 |   1000 |
| 2/1/2018 |   170 |       80 |        |
| 3/1/2018 |   190 |       84 |    100 |
+----------+-------+----------+--------+

df2: New data (2/1/2018 and 4/1/2018) and updated data (3/1/2018) for Google.

+----------+--------+
|   Date   | Google |
+----------+--------+
| 2/1/2018 |    500 |
| 3/1/2018 |    300 |
| 4/1/2018 |    200 |
+----------+--------+

df3: New data for Amazon

+----------+--------+
|   Date   | Amazon |
+----------+--------+
| 1/1/2018 |   1000 |
| 2/1/2018 |   1500 |
| 3/1/2018 |   2000 |
| 4/1/2018 |   3000 |
+----------+--------+

df4 Final output: Basically, it merges and updates all the data into the database. (df1 + df2 + df3) --> this will be the updated database of df1

+----------+-------+----------+--------+--------+
|   Date   | Apple | Facebook | Google | Amazon |
+----------+-------+----------+--------+--------+
| 1/1/2018 |   161 |       58 |   1000 |   1000 |
| 2/1/2018 |   170 |       80 |    500 |   1500 |
| 3/1/2018 |   190 |       84 |    300 |   2000 |
| 4/1/2018 |       |          |    200 |   3000 |
+----------+-------+----------+--------+--------+

I do not know how to combine df1 and df3.

And I do not know how to combine df1 and df2 (add new row: 4/1/2018) while at the same time updating the data (2/1/2018 -> Original data: NaN; amended data: 500 | 3/1/2018 -> Original data: 100; amended data: 300) and leaving the existing intact data (1/1/2018).

Can anyone help me to get df4? =)

Thank you.

EDIT: Based on Sociopath suggestion, I amended the code to:

dataframes = [df2, df3]
df4 = df1

for i in dataframes:
    # Merge the dataframe
    df4 = df4.merge(i, how='outer', on='date')

    # Get the stock name
    stock_name = i.columns[1]

    # To check if there is any column with "_x", if have, then combine these columns
    if stock_name+"_x" in df4.columns:
        x = stock_name+"_x"
        y = stock_name+"_y"
        df4[stock_name] = df4[y].fillna(df4[x])
        df4.drop([x, y], 1, inplace=True)

Sociopath · Accepted Answer · 2019-01-07 13:45:50Z

2

You need merge:

df1 = pd.DataFrame({'date':['2/1/2018','3/1/2018','4/1/2018'], 'Google':[500,300,200]})
df2 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018','4/1/2018'], 'Amazon':[1000,1500,2000,3000]})
df3 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018'], 'Apple':[161,171,181], 'Google':[1000,None,100], 'Facebook':[58,75,65]})

If the column is not present in current database simply use merge as below

df_new = df3.merge(df2, how='outer',on=['date'])

If the column in present in DB then use fillna to update the values as below:

df_new = df_new.merge(df1, how='outer', on='date')
#print(df_new)
df_new['Google'] = df_new['Google_y'].fillna(df_new['Google_x'])
df_new.drop(['Google_x','Google_y'], 1, inplace=True)

Output:

    date       Apple    Facebook    Amazon  Google
0   1/1/2018    161.0   58.0        1000    1000.0
1   2/1/2018    171.0   75.0        1500    500.0
2   3/1/2018    181.0   65.0        2000    300.0
3   4/1/2018    NaN     NaN         3000    200.0

EDIT

5 Comments

KaiWei Over a year ago

ok cool. thanks =). I only got 1 more question. The example is Google. However, it might not be Google. Is there any way to avoid hardcoded 'Google_y' and 'Google_x' in the code?

Sociopath Over a year ago

yes there is!!! I'm assuming you will be having multiple dataframes. Get their column names(i.e. stock name) and use for loop over it and add postfix _x and _y. So you don't have to hardcopy column name.

KaiWei Over a year ago

Cool! Thanks a lot!!! I have amended your suggested generic solution. See my edit. =)

BoboDarph Over a year ago

@KaiWei might want to accept the answer too then :D

mike01010 Over a year ago

i think you may be able to do this with loc, in fewer lines. something like: df1.loc(df2.index, df2.columns] = df2 not sure which would be more peformant/faster though

Collectives™ on Stack Overflow

Python Pandas merge and update dataframe

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related