Python Pandas Dataframe drop columns if string contains special character

Question

I have a dataframe:

Product		Storage	Price
Azure	(2.4%	Server	£540
AWS		Server	£640
GCP		Server	£540

I would like to remove the column which contains the string '(2.4%' however I only want to remove the column in Pandas through regex if regex finds either a bracket or percentage in the string in that column '(%' and then pandas should drop that column entirely.

Please can you help me find a way to use regex to search for special characters within a string and drop the column if that condition is met?

I've searched on stack/google. I've used the following so far:

df = df.drop([col for col in df.columns if df[col].eq('(%').any()], axis=1)

chars = '(%'
regex = f'[{"".join(map(re.escape, chars))}]'

df = df.loc[:, ~df.apply(lambda c: c.str.contains(regex).any())]

however neither of these worked.

Any help would be greatly appreciated. :)

Thank You * Insert Smiley*

Please provide a Minimal, Reproducible Example stackoverflow.com/questions/20109391/… — sayan dasgupta
– sayan dasgupta, Commented Sep 14, 2022 at 15:10

Mouad Slimane · Accepted Answer · 2022-09-14 15:24:22Z

1

you re using eq function it check exactly if the value in the columun match % instead of eq do this

df.drop([col for col in df.columns if df[col].apply(lambda x:'(%' in str(x)).any()], axis=1,inplace=True)

answered Sep 14, 2022 at 15:24

Mouad Slimane

1,0635 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user19795989 Over a year ago

Thank You for your response, this worked out of the box and was the neatest solution. :)

Eelco van Vliet Over a year ago

'(%' in '(2.4%' evaluates to False, so this won't drop the column

Eelco van Vliet · Accepted Answer · 2022-09-14 15:23:51Z

1

I would do something like this

import pandas as pd
from io import StringIO

text = """
Product,Perc,Storage,Price
Azure,(2.4%,Server,£540
AWS,,Server,£640
GCP,,Server,£540
"""
data = pd.read_csv(StringIO(text))
print(data)

drop_columns = list()
for col_name in data.columns:
    has_special_characters = data[col_name].str.contains("[\(%]")
    if has_special_characters.any():
        drop_columns.append(col_name)

print(f"Dropping {drop_columns}")
data.drop(drop_columns, axis=1, inplace=True)
print(data)

Output of the script is:

  Product   Perc Storage Price
0   Azure  (2.4%  Server  £540
1     AWS    NaN  Server  £640
2     GCP    NaN  Server  £540
Dropping ['Perc']
  Product Storage Price
0   Azure  Server  £540
1     AWS  Server  £640
2     GCP  Server  £540

Process finished with exit code 0

answered Sep 14, 2022 at 15:23

Eelco van Vliet

1,24813 silver badges21 bronze badges

5 Comments

user19795989 Over a year ago

Thank you for your response and for taking the time to help me find a solution! :)

Eelco van Vliet Over a year ago

No problem. Btw: the answer by to_data is not correct. He should have used a regular expression, his expression evaluates to False. I tried it and the column with the (2.4% is not dropped. For the rest it is identical as my answer, except he uses a one-liner, which generally leads to code that is harder to read.

Eelco van Vliet Over a year ago

And I dont want to be a spoiler, but the answer by Scaro0974 is also not correct in case you have None values in your data frame (as is the case if you read the data from a csv file using read_csv). The None values will raise a TypeError in the regular expression. Moreover, iterating over the elements of a column is generally much slower than the inbuilt string evaluation on the whole column, as I did. So I come to the conclusion that I have given the only correct answer :-)

user19795989 Over a year ago

Hi Eelco, the only problem with this solution is I have a predefined list with many elements some 2,000+ and when I create a dataFrame the column holding the 2.4% doesn't have a column header. So I can't predefine a name for that column, unless I create a positional argument first in Pandas to change the name to Perc.

Eelco van Vliet Over a year ago

I have given the name 'Perc' in the header of the 'csv file', I don't use it in the script, because there it gets assigned to the variable 'col_name'. In case I would have used an empty field in the csv file, the read_csv file would have automatically assigned a name to it as 'Unnamed: 1'. If you have an empty string as column name, you can still use this script. You only have to be careful that you don'nt have mulitple columns with the same name. It is better to assign names to all your columns

Scaro974 · Accepted Answer · 2022-09-14 15:26:06Z

1

You can try this (I guess the name of the column you want to drop is ""):

import re

change_col = False
for elem in df[""]:
    if re.search(r'[(%]', elem):
        change_col = True

if change_col:
    df = df.drop("", axis=1)

answered Sep 14, 2022 at 15:26

Scaro974

2391 silver badge5 bronze badges

2 Comments

user19795989 Over a year ago

Thank you for your response and for taking the time to help me find a solution, this worked and was the second neatest solution! :)

Eelco van Vliet Over a year ago

The empty elements in your column will raise a TypeError, you cannot assume the empty values are empty strings (normally these are represented by nan values)

Collectives™ on Stack Overflow

Python Pandas Dataframe drop columns if string contains special character

3 Answers 3

2 Comments

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related