1

If you have a Pandas dataframe like this, filtering this way works:

df = pd.DataFrame({'name1': ['apple','pear','applepie','APPLE'],
                'name2': ['APPLE','PEAR','apple','APPLE']
            })
df[df['name1'] != df['name2']] # works

But how do you filter rows, when you want to compare the upper values of the columns?

df[df['name1'].upper() != df['name2'].upper()]  # does not work

3 Answers 3

4

You need to use pandas.Series.str.upper() as df['name1'] is a series of strings and hence we use .str string accessor for vectorized string manipulation.

df[df['name1'].str.upper() != df['name2'].str.upper()]

Output:

       name1    name2
2   applepie    apple
Sign up to request clarification or add additional context in comments.

Comments

1

For ASCII-only, check above :)

Just as an observation, following this very good answer from @Veedrac, if you want to compare case-insensitive for lots of rows in many languages, you might want to normalize and casefold the values first

df.col.str.normalize('NFKD').transform(str.casefold)

Example

df=pd.DataFrame({'t':['a','b','A', 'ê', 'ê', 'Ê', 'ß', 'ss']})

df.t.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False

and

df.t.str.lower().duplicated()
0    False
1    False
2     True
3    False
4    False
5     True
6    False
7    False

But

df.t.str.normalize('NFKD').transform(str.casefold).duplicated(keep=False)

0     True
1    False
2     True
3     True
4     True
5     True
6     True
7     True

Comments

1

Often times it can be faster to use list comprehensions when dealing with strings in pandas.

pd.DataFrame(
    [[i, j] for i, j in zip(df.name1, df.name2) if i.upper() != j.upper()],
    columns=df.columns
)

      name1  name2
0  applepie  apple

Some timings:

In [159]: df = pd.concat([df]*10000)

In [160]: %%timeit
     ...:     pd.DataFrame(
     ...:         [[i, j] for i, j in zip(df.name1, df.name2) if i.upper() != j.upper()]
     ...: ,
     ...:         columns=df.columns
     ...:     )
     ...:
14.2 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [161]: %timeit df[df['name1'].str.upper() != df['name2'].str.upper()]
35.6 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.