1

I have two tables:

  1. A three-level index of stock symbols cik, buy dates t0, and sell dates t1.
  2. A blank position DataFrame with range of dates in the index column and stock symbols across the columns.

I need to iterate through the first index, and set all of the values in the position matrix to 1 where the date is in the range of [t0, t1] to 1. The rest should be left at zero.

Sell Index sell_idx

MultiIndex([('AAPL', '2020-03-12', '2020-03-13'),
            ( 'IBM', '2020-03-13', '2020-03-16')],
           )

Position Matrix pos

            FB  AAPL  IBM
2020-03-12   0     0    0
2020-03-13   0     0    0
2020-03-16   0     0    0

Expected output

            FB  AAPL  IBM
2020-03-12   0     1    0
2020-03-13   0     1    1
2020-03-16   0     0    1

I have done this successfully iteratively and frankly it's not even that slow:

idx = pd.MultiIndex.from_tuples(
    (
        ('AAPL', pd.Timestamp('2020-03-12'), pd.Timestamp('2020-03-13'))
        , ('IBM', pd.Timestamp('2020-03-13'), pd.Timestamp('2020-03-16'))
    )
)

pos = pd.DataFrame(0, columns=['FB', 'AAPL', 'IBM']
                   , index=[pd.Timestamp('2020-03-12')
                   , pd.Timestamp('2020-03-13')
                   , pd.Timestamp('2020-03-16')])

for i in idx:
    pos.loc[i[1]:i[2], i[0]] = 1

I would like to vectorize this code. How would I use advanced pandas slicing/indexing to do this without apply or for?

5
  • 2
    What is your expected output for the sample input? Not sure what the column names in pos are supposed to represent Commented Feb 14, 2022 at 19:05
  • If sell_idx consists of (1800, '2020-03-12', '2020-03-13') then expected output for pos would be 2020-03-12 0 1 0 2020-03-13 0 1 0 2020-03-16 0 0 0 Commented Feb 14, 2022 at 19:09
  • I think apply would arguably be less readable than what I already have, as you can't use assignment in a lambda function. So my guess is it won't improve performance or readability? Commented Feb 14, 2022 at 19:21
  • 2
    kindly provide the multiindex as code: pd.MultiIndex.from_tuples... or dictionary, or sometihing reproducible Commented Feb 14, 2022 at 19:55
  • 1
    ok, updated the post with fully reproducible code. Thanks! Commented Feb 14, 2022 at 20:31

1 Answer 1

1

Build an interval index; luckily your data does not have overlaps:

intervals = pd.IntervalIndex.from_arrays(idx.get_level_values(1), 
                                         idx.get_level_values(-1), 
                                         closed="both" )

Get matches:

arr = intervals.get_indexer(pos.index)

Create new dataframe:

index = [pos.index, idx.get_level_values(0)[arr]]
mapping = pd.Series([1] * len(arr), index = index).unstack(fill_value = 0)

Get columns, if any that does not exist in mapping:

difference = pos.columns.difference(mapping.columns)

Join to pos to get the final output:

pos.filter(difference).join(mapping, how="left")

            FB  AAPL  IBM
2020-03-12   0     1    0
2020-03-13   0     1    0
2020-03-16   0     0    1

This should scale well as the data size increases; note however, that this relies on the intervals not overlapping, and also that there are no duplicates in the data (allowing unstack to work as well)

Sign up to request clarification or add additional context in comments.

3 Comments

Unfortunately the data certainly has overlaps (we may hold more than one stock in the portfolio)... thank you for a thoughtful answer in any case.
Kindly share an example of overlaps in the date boundaries
done - there is overlap now

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.