5

I have a multindex dataframe with 3 index levels and 2 numerical columns.

A   1   2017-04-01  14.0    87.346878
        2017-06-01  4.0     87.347504
    2   2014-08-01  1.0     123.110001
        2015-01-01  4.0     209.612503
B   3   2014-07-01  1.0     68.540001
        2014-12-01  1.0     64.370003
    4   2015-01-01  3.0     75.000000

I want to replace the values in first row of 3rd index level wherever a new second level index begins. For ex: every first row

(A,1,2017-04-01)->0.0   0.0 
(A,2,2014-08-01)->0.0   0.0  
(B,3,2014-07-01)->0.0   0.0  
(B,4,2015-01-01)->0.0   0.0

The dataframe is too big and doing it datframe by dataframe like df.xs('A,1')...df.xs(A,2) gets time consuming. Is there some way where i can get a mask and replace with new values in these positions ?

4 Answers 4

1

Use DataFrame.reset_index on level=2, then use DataFrame.groupby on level=[0, 1] and aggregate level_2 using first, then using pd.MultiIndex.from_arrays create a multilevel index, finally use this multilevel index to change the values in dataframe:

idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0

Result:

# print(df)
               col1        col2
A 1 2017-04-01  0.0    0.000000
    2017-06-01  4.0   87.347504
  2 2014-08-01  0.0    0.000000
    2015-01-01  4.0  209.612503
B 3 2014-07-01  0.0    0.000000
    2014-12-01  1.0   64.370003
  4 2015-01-01  0.0    0.000000
Sign up to request clarification or add additional context in comments.

Comments

1

We can extract a series of the second-level index with:

df.index.get_level_values(1)
# output: Int64Index([1, 1, 2, 2, 3, 3, 4], dtype='int64')

And check where it changes with:

idx = df.index.get_level_values(1)
np.where(idx != np.roll(idx, 1))[0]
# output: array([0, 2, 4, 6])

So we can simply use the returned value of the second statement with iloc to get the first row of every second-level index and modify their values like this:

idx = df.index.get_level_values(1)
df.iloc[np.where(idx != np.roll(idx, 1))[0]] = 0

output:

                  value1      value2
A 1 2017-04-01       0.0    0.000000
    2017-06-01       4.0   87.347504
  2 2014-08-01       0.0    0.000000
    2015-01-01       4.0  209.612503
B 3 2014-07-01       0.0    0.000000
    2014-12-01       1.0   64.370003
  4 2015-01-01       0.0    0.000000

1 Comment

This wouldn't work in the general case, e.g. when the 5th index is for instance ('B', 2, '2014-07-01').
1

You can use the grouper indices in a simple iloc:

df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0

Example:

df = pd.DataFrame({'col1': [14., 4., 1., 4., 1., 1., 3.],
                   'col2': [ 87.346878, 87.347504, 123.110001, 209.612503, 68.540001, 64.370003, 75.]},
                   index = pd.MultiIndex.from_tuples(([('A', 1, '2017-04-01'), ('A', 1, '2017-06-01'),
                                                       ('A', 2, '2014-08-01'), ('A', 2, '2015-01-01'),
                                                       ('B', 3, '2014-07-01'), ('B', 3, '2014-12-01'),
                                                       ('B', 4, '2015-01-01')])))

Result:

                col1        col2
A 1 2017-04-01   0.0    0.000000
    2017-06-01   4.0   87.347504
  2 2014-08-01   0.0    0.000000
    2015-01-01   4.0  209.612503
B 3 2014-07-01   0.0    0.000000
    2014-12-01   1.0   64.370003
  4 2015-01-01   0.0    0.000000

Timings:

%%timeit
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
#6.7 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
#897 µs ± 6.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So this is about 7 times faster than the accepted answer

Comments

0

I think you can use something like this:

import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))

df = pd.DataFrame([['A', 'B'], ['bar', 'two'],
                   ['foo', 'one'], ['foo', 'two']],
                 columns=['first', 'second'])
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

You can create a list of unique values from your index. Then get the index position, to replace on your column the row value coincidence with the row value.

lst = ['bar','foo', 'qux']
ls = []
for i in lst:
    base = df.index.get_loc(i)
    a = base.indices(len(df))
    a = a[0]
    ls.append(a)
    
    for ii in ls:
    #print(ii)
        df[0][ii] = 0

df

Fortunately, this can help you.

Cheers!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.