Setting dataframe values in range of dates specified in two columns

Question

I have two tables:

A three-level index of stock symbols cik, buy dates t0, and sell dates t1.
A blank position DataFrame with range of dates in the index column and stock symbols across the columns.

I need to iterate through the first index, and set all of the values in the position matrix to 1 where the date is in the range of [t0, t1] to 1. The rest should be left at zero.

Sell Index sell_idx

MultiIndex([('AAPL', '2020-03-12', '2020-03-13'),
            ( 'IBM', '2020-03-13', '2020-03-16')],
           )

Position Matrix pos

            FB  AAPL  IBM
2020-03-12   0     0    0
2020-03-13   0     0    0
2020-03-16   0     0    0

Expected output

            FB  AAPL  IBM
2020-03-12   0     1    0
2020-03-13   0     1    1
2020-03-16   0     0    1

I have done this successfully iteratively and frankly it's not even that slow:

idx = pd.MultiIndex.from_tuples(
    (
        ('AAPL', pd.Timestamp('2020-03-12'), pd.Timestamp('2020-03-13'))
        , ('IBM', pd.Timestamp('2020-03-13'), pd.Timestamp('2020-03-16'))
    )
)

pos = pd.DataFrame(0, columns=['FB', 'AAPL', 'IBM']
                   , index=[pd.Timestamp('2020-03-12')
                   , pd.Timestamp('2020-03-13')
                   , pd.Timestamp('2020-03-16')])

for i in idx:
    pos.loc[i[1]:i[2], i[0]] = 1

I would like to vectorize this code. How would I use advanced pandas slicing/indexing to do this without apply or for?

What is your expected output for the sample input? Not sure what the column names in pos are supposed to represent — not_speshal
– not_speshal, Commented Feb 14, 2022 at 19:05
If sell_idx consists of (1800, '2020-03-12', '2020-03-13') then expected output for pos would be 2020-03-12 0 1 0 2020-03-13 0 1 0 2020-03-16 0 0 0 — Ilya Voytov
– Ilya Voytov, Commented Feb 14, 2022 at 19:09
I think apply would arguably be less readable than what I already have, as you can't use assignment in a lambda function. So my guess is it won't improve performance or readability? — Ilya Voytov
– Ilya Voytov, Commented Feb 14, 2022 at 19:21
kindly provide the multiindex as code: pd.MultiIndex.from_tuples... or dictionary, or sometihing reproducible — sammywemmy
– sammywemmy, Commented Feb 14, 2022 at 19:55

sammywemmy · Accepted Answer · 2022-02-14 22:58:44Z

1

Build an interval index; luckily your data does not have overlaps:

intervals = pd.IntervalIndex.from_arrays(idx.get_level_values(1), 
                                         idx.get_level_values(-1), 
                                         closed="both" )

Get matches:

arr = intervals.get_indexer(pos.index)

Create new dataframe:

index = [pos.index, idx.get_level_values(0)[arr]]
mapping = pd.Series([1] * len(arr), index = index).unstack(fill_value = 0)

Get columns, if any that does not exist in mapping:

difference = pos.columns.difference(mapping.columns)

Join to pos to get the final output:

pos.filter(difference).join(mapping, how="left")

            FB  AAPL  IBM
2020-03-12   0     1    0
2020-03-13   0     1    0
2020-03-16   0     0    1

This should scale well as the data size increases; note however, that this relies on the intervals not overlapping, and also that there are no duplicates in the data (allowing unstack to work as well)

answered Feb 14, 2022 at 22:58

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ilya Voytov Over a year ago

Unfortunately the data certainly has overlaps (we may hold more than one stock in the portfolio)... thank you for a thoughtful answer in any case.

sammywemmy Over a year ago

Kindly share an example of overlaps in the date boundaries

Ilya Voytov Over a year ago

done - there is overlap now

Collectives™ on Stack Overflow

Setting dataframe values in range of dates specified in two columns

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related