Pandas DataFrame change a value based on column, index values comparison

Question

Suppose that you have a pandas DataFrame which has some kind of data in the body and numbers in the column and index names.

>>> data=np.array([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']])
>>> columns = [2, 4, 8]
>>> index = [10, 4, 2]
>>> df = pd.DataFrame(data, columns=columns, index=index)
>>> df
    2  4  8
10  a  b  c
4   d  e  f
2   g  h  i

Now suppose we want to manipulate are data frame in some kind of way based on comparing the index and columns. Consider the following.

Where index is greater than column replace letter with 'k':

    2  4  8
10  k  k  k
4   k  e  f
2   g  h  i

Where index is equal to column replace letter with 'U':

    2  4  8
10  k  k  k
4   k  U  f
2   U  h  i

Where column is greater than index replace letter with 'Y':

    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

To keep the question useful to all:

What is a fast way to do this replacement?
What is the simplest way to do this replacement?

Speed Results from minimal example

jezrael: 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
user3471881: 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
thunderwood: 4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Is this a duplicate? I searched google for pandas replace compare index column and the top results are:

Pandas - Compare two dataframes and replace values matching condition

Python pandas: replace values based on location not index value

Pandas DataFrame: replace all values in a column, based on condition

However, I don't feel any of these touch on whether this a) possible or b) how to compare in such a way

Maybe I'm stupid but your expected output doesn't match your conditions to me. For example: why don't you add a k to all rows where index is 10 and why is index 2 greater than column 2? And why does all rows for index 2 get the value Y? — user3471881
– user3471881, Commented Nov 2, 2018 at 13:02
@user3471881 totally correct, think I got it fixed. They looked right when I looked it over... completely wrong though. Thanks +1 — akozi
– akozi, Commented Nov 2, 2018 at 13:06

jezrael · Accepted Answer · 2018-11-02 14:40:44Z

9

I think you need numpy.select with broadcasting:

m1 = df.index.values[:, None] > df.columns.values
m2 = df.index.values[:, None] == df.columns.values


df = pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)
print (df)
    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

Performance:

np.random.seed(1000)

N = 1000
a = np.random.randint(100, size=N)
b = np.random.randint(100, size=N)

df = pd.DataFrame(np.random.choice(list('abcdefgh'), size=(N, N)), columns=a, index=b)
#print (df)

def us(df):
    values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')
    greater = values > 0
    less = values < 0
    same = values == 0

    values[greater] = 'k'
    values[less] = 'Y'
    values[same] = 'U'


    return pd.DataFrame(values, columns=df.columns, index=df.index)

def jez(df):

    m1 = df.index.values[:, None] > df.columns.values
    m2 = df.index.values[:, None] == df.columns.values
    return pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)

In [236]: %timeit us(df)
107 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [237]: %timeit jez(df)
64 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 2, 2018 at 14:40

answered Nov 2, 2018 at 13:03

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

user3471881 Over a year ago

This crashes for me if columns and rows are of different length.

jezrael Over a year ago

@user3471881 - Thank you, added solution - casting to numpy arrays by .values. Also it should be better performance in bigger DataFrame.

akozi Over a year ago

Would this method work if some of the original data was to be maintained? It seems like it is using the fact that all the points are changed here.

akozi Over a year ago

I'll give you the checkmark back for the same reason I gave it to user :)

akozi Over a year ago

When you use np.select all of the points not covered by m1 and m2 are filled by the value 'Y'. If the final filter replacing to 'Y' was not used could you show an example of how you would use your answer to also output the value 'f', 'h', and 'i'?

|

Thunderwood · Accepted Answer · 2018-11-02 13:09:07Z

2

Not sure about the fastest way to accomplish this but an incredibly simple way would be to just iterate over the dataframe like such:

for i in df.index:
    for j in df.columns:
        if i>j:
            df.loc[i,j]='k'
        elif j>i:
            df.loc[i,j]='y'
        else:
            df.loc[i,j]='u'

answered Nov 2, 2018 at 13:09

Thunderwood

5062 silver badges10 bronze badges

4 Comments

user3471881 Over a year ago

This is ~8 times slower than using numpy as demonstrated by @jezrael.

Thunderwood Over a year ago

true, but the question asked for both the fastest, and the simplest way of doing things. This is so simple a total beginner that just picked up python this week could understand it.

user3471881 Over a year ago

I didn't mean it as a negative comment, just added it cuz OP asked.

Thunderwood Over a year ago

ah okay understood

user3471881 · Accepted Answer · 2018-11-02 15:11:26Z

1

1. Using np.arrays + np.select:

values = np.array(np.array([df.index]).transpose() - np.array([df.columns]))

greater = values > 0
same = values == 0

df = pd.DataFrame(np.select([greater, same], ['k', 'U'], 'Y'), columns=df.columns, index=df.index)

2. Using np.arrays and manual masking.

values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')

greater = values > 0
less = values < 0
same = values == 0

values[greater] = 'k'
values[less] = 'Y'
values[same] = 'U'


df = pd.DataFrame(values, columns=df.columns, index=df.index)

edited Nov 2, 2018 at 15:11

answered Nov 2, 2018 at 13:59

user3471881

2,7443 gold badges21 silver badges35 bronze badges

4 Comments

akozi Over a year ago

Thanks for the answer. What package are you using for speed? I was going to add different speeds to the OP, and I quite like the output yours gives.

akozi Over a year ago

I think I will give this the checkmark for now since it is faster and roughly the same complexity. Hard to quantify the complexity I don't think yours or Thunderwood's answer is any harder to read.

jezrael Over a year ago

@akozi - What is size of your real DataFrame? What is performance in your real data? Because testing in small data sample should be different like in bigger df.

akozi Over a year ago

Would this work if not all of the points were changed from their original. Personally, my data sets range around 500x500. Still small enough that the differences in speeds between the methods are not very noticeable.

Collectives™ on Stack Overflow

Pandas DataFrame change a value based on column, index values comparison

3 Answers 3

8 Comments

4 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related