6

Suppose that you have a pandas DataFrame which has some kind of data in the body and numbers in the column and index names.

>>> data=np.array([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']])
>>> columns = [2, 4, 8]
>>> index = [10, 4, 2]
>>> df = pd.DataFrame(data, columns=columns, index=index)
>>> df
    2  4  8
10  a  b  c
4   d  e  f
2   g  h  i

Now suppose we want to manipulate are data frame in some kind of way based on comparing the index and columns. Consider the following.

Where index is greater than column replace letter with 'k':

    2  4  8
10  k  k  k
4   k  e  f
2   g  h  i

Where index is equal to column replace letter with 'U':

    2  4  8
10  k  k  k
4   k  U  f
2   U  h  i

Where column is greater than index replace letter with 'Y':

    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

To keep the question useful to all:

  • What is a fast way to do this replacement?

  • What is the simplest way to do this replacement?

Speed Results from minimal example

  • jezrael: 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  • user3471881: 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  • thunderwood: 4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Is this a duplicate? I searched google for pandas replace compare index column and the top results are:

Pandas - Compare two dataframes and replace values matching condition

Python pandas: replace values based on location not index value

Pandas DataFrame: replace all values in a column, based on condition

However, I don't feel any of these touch on whether this a) possible or b) how to compare in such a way

3
  • 2
    Maybe I'm stupid but your expected output doesn't match your conditions to me. For example: why don't you add a k to all rows where index is 10 and why is index 2 greater than column 2? And why does all rows for index 2 get the value Y? Commented Nov 2, 2018 at 13:02
  • 2
    @user3471881 totally correct, think I got it fixed. They looked right when I looked it over... completely wrong though. Thanks +1 Commented Nov 2, 2018 at 13:06
  • 1
    I updated my code a lil bit so should be quicker now. Commented Nov 2, 2018 at 14:15

3 Answers 3

9

I think you need numpy.select with broadcasting:

m1 = df.index.values[:, None] > df.columns.values
m2 = df.index.values[:, None] == df.columns.values


df = pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)
print (df)
    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

Performance:

np.random.seed(1000)

N = 1000
a = np.random.randint(100, size=N)
b = np.random.randint(100, size=N)

df = pd.DataFrame(np.random.choice(list('abcdefgh'), size=(N, N)), columns=a, index=b)
#print (df)

def us(df):
    values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')
    greater = values > 0
    less = values < 0
    same = values == 0

    values[greater] = 'k'
    values[less] = 'Y'
    values[same] = 'U'


    return pd.DataFrame(values, columns=df.columns, index=df.index)

def jez(df):

    m1 = df.index.values[:, None] > df.columns.values
    m2 = df.index.values[:, None] == df.columns.values
    return pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)

In [236]: %timeit us(df)
107 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [237]: %timeit jez(df)
64 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

8 Comments

This crashes for me if columns and rows are of different length.
@user3471881 - Thank you, added solution - casting to numpy arrays by .values. Also it should be better performance in bigger DataFrame.
Would this method work if some of the original data was to be maintained? It seems like it is using the fact that all the points are changed here.
I'll give you the checkmark back for the same reason I gave it to user :)
When you use np.select all of the points not covered by m1 and m2 are filled by the value 'Y'. If the final filter replacing to 'Y' was not used could you show an example of how you would use your answer to also output the value 'f', 'h', and 'i'?
|
2

Not sure about the fastest way to accomplish this but an incredibly simple way would be to just iterate over the dataframe like such:

for i in df.index:
    for j in df.columns:
        if i>j:
            df.loc[i,j]='k'
        elif j>i:
            df.loc[i,j]='y'
        else:
            df.loc[i,j]='u'

4 Comments

This is ~8 times slower than using numpy as demonstrated by @jezrael.
true, but the question asked for both the fastest, and the simplest way of doing things. This is so simple a total beginner that just picked up python this week could understand it.
I didn't mean it as a negative comment, just added it cuz OP asked.
ah okay understood
1

1. Using np.arrays + np.select:

values = np.array(np.array([df.index]).transpose() - np.array([df.columns]))

greater = values > 0
same = values == 0

df = pd.DataFrame(np.select([greater, same], ['k', 'U'], 'Y'), columns=df.columns, index=df.index)

2. Using np.arrays and manual masking.

values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')

greater = values > 0
less = values < 0
same = values == 0

values[greater] = 'k'
values[less] = 'Y'
values[same] = 'U'


df = pd.DataFrame(values, columns=df.columns, index=df.index)

4 Comments

Thanks for the answer. What package are you using for speed? I was going to add different speeds to the OP, and I quite like the output yours gives.
I think I will give this the checkmark for now since it is faster and roughly the same complexity. Hard to quantify the complexity I don't think yours or Thunderwood's answer is any harder to read.
@akozi - What is size of your real DataFrame? What is performance in your real data? Because testing in small data sample should be different like in bigger df.
Would this work if not all of the points were changed from their original. Personally, my data sets range around 500x500. Still small enough that the differences in speeds between the methods are not very noticeable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.