1

I have a table that contains the columns Date, Identifier, and Price e.g.

| Identifier |   Date   | Price |
|------------|----------|-------|
|    693477AA|1990/10/31|    100|
|    353477ZB|1991/08/31|    101|
|    123457ZB|1992/08/31|    105|

I'm using pandas read_sql function to grab the data from a SQL Server database. Either using SQL or pandas DataFrame functionality I need to get the data into the following pandas DataFrame format.

           693477AA     353477ZB     123457ZB
Date           
1988-1-1        NaN          NaN        99.41
1988-1-2     100.54          NaN        98.11
1988-1-3      99.45          NaN          NaN 

So that there is a (possibly Null) price entry for every DISTINCT Date in the table. For the set of Identifiers satisfying a condition.

Right now I have it working with a for loop,

data = []
identifiers = "SELECT DISTINCT Identifier FROM TABLE WHERE [Condition]"
for id in identifiers:
    data.append("SELECT Date, Price FROM TABLE WHERE Identifier=[id] ORDER BY DATE")
pandas.concat(data, axis=1)

However this only works for very strict [Condition] since the table is quite large (>3M rows).

How can I implement SQL, DataFrame manipulation, or a combination of both to achieve the desired format?

Thanks.

1
  • Why use a loop to query the db? Commented Aug 3, 2017 at 21:07

2 Answers 2

6

we can use pivot() function:

In [144]: df.pivot(index='Date', columns='Identifier', values='Price').rename_axis(None, 1)
Out[144]:
            123457ZB  353477ZB  693477AA
Date
1990/10/31       NaN       NaN     100.0
1991/08/31       NaN     101.0       NaN
1992/08/31     105.0       NaN       NaN

or DataFrame.unstack():

In [149]: df.set_index(['Date','Identifier'])['Price'].unstack('Identifier')
Out[149]:
Identifier  123457ZB  353477ZB  693477AA
Date
1990/10/31       NaN       NaN     100.0
1991/08/31       NaN     101.0       NaN
1992/08/31     105.0       NaN       NaN

or crosstab():

In [154]: pd.crosstab(index=df['Date'], columns=df['Identifier'], 
                      values=df['Price'], aggfunc='first') \
            .rename_axis(None, 1)
Out[154]:
            123457ZB  353477ZB  693477AA
Date
1990/10/31       NaN       NaN     100.0
1991/08/31       NaN     101.0       NaN
1992/08/31     105.0       NaN       NaN

or pivot_table():

In [156]: df.pivot_table(index='Date', columns='Identifier', values='Price', fill_value=0).rename_axis(None, 1)
Out[156]:
            123457ZB  353477ZB  693477AA
Date
1990/10/31         0         0       100
1991/08/31         0       101         0
1992/08/31       105         0         0

PS if you prefer to "pivot" data on the SQL Server side - please check this question

Sign up to request clarification or add additional context in comments.

Comments

1

Or you can using groupby.

df.groupby(['Identifier','Date'])['Price'].sum().unstack('Identifier')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.