5

I am currently using Python and Pandas to form a stock price "database". I managed to find some codes to download the stock prices.

df1 is my existing database. Each time I download the share price, it will look like df2 and df3. Then, i need to combine df1, df2 and df3 data to look like df4.

Each stock has its own column. Each date has its own row.

df1: Existing database

+----------+-------+----------+--------+
|   Date   | Apple | Facebook | Google |
+----------+-------+----------+--------+
| 1/1/2018 |   161 |       58 |   1000 |
| 2/1/2018 |   170 |       80 |        |
| 3/1/2018 |   190 |       84 |    100 |
+----------+-------+----------+--------+

df2: New data (2/1/2018 and 4/1/2018) and updated data (3/1/2018) for Google.

+----------+--------+
|   Date   | Google |
+----------+--------+
| 2/1/2018 |    500 |
| 3/1/2018 |    300 |
| 4/1/2018 |    200 |
+----------+--------+

df3: New data for Amazon

+----------+--------+
|   Date   | Amazon |
+----------+--------+
| 1/1/2018 |   1000 |
| 2/1/2018 |   1500 |
| 3/1/2018 |   2000 |
| 4/1/2018 |   3000 |
+----------+--------+

df4 Final output: Basically, it merges and updates all the data into the database. (df1 + df2 + df3) --> this will be the updated database of df1

+----------+-------+----------+--------+--------+
|   Date   | Apple | Facebook | Google | Amazon |
+----------+-------+----------+--------+--------+
| 1/1/2018 |   161 |       58 |   1000 |   1000 |
| 2/1/2018 |   170 |       80 |    500 |   1500 |
| 3/1/2018 |   190 |       84 |    300 |   2000 |
| 4/1/2018 |       |          |    200 |   3000 |
+----------+-------+----------+--------+--------+

I do not know how to combine df1 and df3.

And I do not know how to combine df1 and df2 (add new row: 4/1/2018) while at the same time updating the data (2/1/2018 -> Original data: NaN; amended data: 500 | 3/1/2018 -> Original data: 100; amended data: 300) and leaving the existing intact data (1/1/2018).

Can anyone help me to get df4? =)

Thank you.

EDIT: Based on Sociopath suggestion, I amended the code to:

dataframes = [df2, df3]
df4 = df1

for i in dataframes:
    # Merge the dataframe
    df4 = df4.merge(i, how='outer', on='date')

    # Get the stock name
    stock_name = i.columns[1]

    # To check if there is any column with "_x", if have, then combine these columns
    if stock_name+"_x" in df4.columns:
        x = stock_name+"_x"
        y = stock_name+"_y"
        df4[stock_name] = df4[y].fillna(df4[x])
        df4.drop([x, y], 1, inplace=True)

1 Answer 1

2

You need merge:

df1 = pd.DataFrame({'date':['2/1/2018','3/1/2018','4/1/2018'], 'Google':[500,300,200]})
df2 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018','4/1/2018'], 'Amazon':[1000,1500,2000,3000]})
df3 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018'], 'Apple':[161,171,181], 'Google':[1000,None,100], 'Facebook':[58,75,65]})

If the column is not present in current database simply use merge as below

df_new = df3.merge(df2, how='outer',on=['date'])

If the column in present in DB then use fillna to update the values as below:

df_new = df_new.merge(df1, how='outer', on='date')
#print(df_new)
df_new['Google'] = df_new['Google_y'].fillna(df_new['Google_x'])
df_new.drop(['Google_x','Google_y'], 1, inplace=True)

Output:

    date       Apple    Facebook    Amazon  Google
0   1/1/2018    161.0   58.0        1000    1000.0
1   2/1/2018    171.0   75.0        1500    500.0
2   3/1/2018    181.0   65.0        2000    300.0
3   4/1/2018    NaN     NaN         3000    200.0

EDIT

More generic solution for later part.

dataframes = [df2, df3, df4]  

for i in dataframes:
    stock_name = list(i.columns.difference(['date']))[0]
    df_new = df_new.merge(i, how='outer', on='date')
    x = stock_name+"_x"
    y = stock_name+"_y"

    df_new[stock_name] = df_new[y].fillna(df_new[x])
    df_new.drop([x,y], 1, inplace=True)
Sign up to request clarification or add additional context in comments.

5 Comments

ok cool. thanks =). I only got 1 more question. The example is Google. However, it might not be Google. Is there any way to avoid hardcoded 'Google_y' and 'Google_x' in the code?
yes there is!!! I'm assuming you will be having multiple dataframes. Get their column names(i.e. stock name) and use for loop over it and add postfix _x and _y. So you don't have to hardcopy column name.
Cool! Thanks a lot!!! I have amended your suggested generic solution. See my edit. =)
@KaiWei might want to accept the answer too then :D
i think you may be able to do this with loc, in fewer lines. something like: df1.loc(df2.index, df2.columns] = df2 not sure which would be more peformant/faster though

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.