2

I have a Pandas dataframe with a column called Zip Code. The column is an object data type and some rows are not in proper zip code format. I would like to remove rows that do not contain ##### format zipcode.

    Subscriber Type     Zip Code
0   Subscriber         94040
1   Customer           11231
2   Customer           11231
3   Customer           32
4   Customer           nil

What would be an easy way to do so? Is there a way to compare format and the records something like this? df.drop(df['Zip Code'] != #####)

1
  • why don't you do df=df[df['Zip Code']!=#####). Commented Jul 26, 2016 at 17:49

1 Answer 1

5

try this:

In [23]: df = df[df['Zip Code'].str.contains(r'^\d{5}$')]

In [24]: df
Out[24]:
  Subscriber Type Zip Code
0      Subscriber    94040
1        Customer    11231
2        Customer    11231

Explanation:

In [22]: df['Zip Code'].str.contains(r'^\d{5}$')
Out[22]:
0     True
1     True
2     True
3    False
4    False
Name: Zip Code, dtype: bool

PS thanks to @Alberto Garcia-Raboso for the refined RegEx!

Sign up to request clarification or add additional context in comments.

3 Comments

Works perfectly, Thanks!
r'\d{5}' gives false positives (for example: 11231asdf, asdf11231, as11231df). You want a more stringent regex: r'^\d{5}$'
@AlbertoGarcia-Raboso, thank you! I've updated my answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.