Using Regex Operators in Python/Pandas to Count Data Entries Conditionally

Question

Using the pandas library in Python, I have a device in my code that looks like this:

BadData = len(df[df.A1.str.contains('A|T|C|G')==False])

What I'm trying to do here is count the number of entries in the A1 column of the dataframe df that do not contain any combination of the letters A, T, C, and G.

These expressions should be counted as BadData:

123
<%*&
foo

But these expressions should not:

A
ATCG
GATCATTA

My question: how could I use regex characters to include entries like "Apple" or "Golfing" in BadData?

I could chain together conditions like so:

BadData = len(df[(df.A1.str.contains('A|T|C|G')==False) & (df.A1.str.contains('0|1|2|3')==TRUE)])

But here I face a difficulty: do I have to define every character that violates the condition? This seems clumsy, and I am sure there is a more elegant way.

sacuL · Accepted Answer · 2018-11-08 00:14:44Z

1

You can use:

df['A1'].str.contains('^[ACTG]+$')

Which makes sure that it both starts (the regex ^) and ends (the regex $) with a letter in ACTG, and only contains one or more of those characters.

To get the len, you can just sum the False values:

bad_data = sum(~df['A1'].str.contains('^[ACTG]+$'))

Which is equivalent to:

bad_data = len(df[df.A1.str.contains('^[ACTG]+$')==False])

But IMO nicer to read.

For example:

>>> df
             A1
0         Apple
1       Golfing
2             A
3          ATTC
4          ACGT
5         AxTCG
6           foo
7             %
8  ACT Golf GTC
9           ACT


>>> df['A1'].str.contains('^[ACTG]+$')
0    False
1    False
2     True
3     True
4     True
5    False
6    False
7    False
8    False
9     True
Name: A1, dtype: bool

bad_data = sum(~df['A1'].str.contains('^[ACTG]+$'))
# 6

edited Nov 8, 2018 at 0:14

answered Nov 8, 2018 at 0:08

sacuL

51.6k9 gold badges88 silver badges115 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Timothy Over a year ago

Elegant and straightforward solution. Thank you! :)

Collectives™ on Stack Overflow

Using Regex Operators in Python/Pandas to Count Data Entries Conditionally

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related