0

I have a data frame like this,

df1
col1    col2
 1        A
 2        A
 3        A
 4        B
 5        A
 6        A
 7        B
 8        A
 9        A
10        A
11        C
12        C
13        A
14        A
15        C
16        A
17        C

In above data frame total number of B and C are always even. Now I want to fill all the values between two B and C with B and C.

So the final data frame should look like,

df1
col1    col2
 1        A
 2        A
 3        A
 4        B
 5        B
 6        B
 7        B
 8        A
 9        A
10        A
11        C
12        C
13        A
14        A
15        C
16        C
17        C

I could do it using a for loop, but the execution time will be huge, I am looking for some pandas shortcut / pythonic way to do it.

5
  • Why does 16 become C but 13 and 14 do not? What are the rules exactly? Can you write a for loop that implements exactly what you need, then we can optimize that? Commented Nov 16, 2019 at 7:54
  • because there is two consecutive C already in 11 and 12 Commented Nov 16, 2019 at 7:56
  • Interesting, do you see how you never mentioned that requirement in the question? Can you provide a simple, maybe slow but correct, for loop that does it? Commented Nov 16, 2019 at 7:58
  • this was obvious @Jhon Zwinck Commented Nov 17, 2019 at 11:43
  • please check my answer @Kallol Samanta Commented Nov 17, 2019 at 11:43

2 Answers 2

1

Idea is filter out consecutive B or C values, then replace all another B or C to missing values. Then forward filling missing values but keep only values same like backfilling, last replace all another values to original with Series.fillna:

for v in ['B','C']:
    m1 = df['col2'].eq(v)
    m2 = m1.ne(m1.shift()).cumsum().duplicated(keep=False)
    s = df['col2'].where(m1 & ~m2)
    ff = s.ffill()
    df['col2'] = ff.where(ff == s.bfill()).fillna(df['col2'])
print (df)
    col1 col2
0      1    A
1      2    A
2      3    A
3      4    B
4      5    B
5      6    B
6      7    B
7      8    A
8      9    A
9     10    A
10    11    C
11    12    C
12    13    A
13    14    A
14    15    C
15    16    C
16    17    C
Sign up to request clarification or add additional context in comments.

1 Comment

this is a bit cumbersome
1

You only need to select when the cumulative sum Series.cumsum is odd + Series.mask:

for l in ['B','C']:
    mask=(df.col2.eq(l).cumsum()%2)==1
    df['col2']=df['col2'].mask(mask,l)
print(df)

    col1 col2
0     1    A 
1     2    A 
2     3    A 
3     4    B 
4     5    B 
5     6    B 
6     7    B 
7     8    A 
8     9    A 
9    10    A 
10   11    C 
11   12    C 
12   13    A 
13   14    A 
14   15    C 
15   16    C 
16   17    C

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.