Filter string data based on its string length

Question

I like to filter out data whose string length is not equal to 10.

If I try to filter out any row whose column A's or B's string length is not equal to 10, I tried this.

df=pd.read_csv('filex.csv')
df.A=df.A.apply(lambda x: x if len(x)== 10 else np.nan)
df.B=df.B.apply(lambda x: x if len(x)== 10 else np.nan)
df=df.dropna(subset=['A','B'], how='any')

This works slow, but is working.

However, it sometimes produce error when the data in A is not a string but a number (interpreted as a number when read_csv read the input file):

File "<stdin>", line 1, in <lambda>
TypeError: object of type 'float' has no len()

I believe there should be more efficient and elegant code instead of this.

Based on the answers and comments below, the simplest solution I found are:

df=df[df.A.apply(lambda x: len(str(x))==10]
df=df[df.B.apply(lambda x: len(str(x))==10]

or

df=df[(df.A.apply(lambda x: len(str(x))==10) & (df.B.apply(lambda x: len(str(x))==10)]

or

df=df[(df.A.astype(str).str.len()==10) & (df.B.astype(str).str.len()==10)]

unutbu · Accepted Answer · 2013-11-12 21:27:17Z

209

import pandas as pd

df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)

Applied to filex.csv:

A,B
123,abc
1234,abcd
1234567890,abcdefghij

the code above prints

            A           B
2  1234567890  abcdefghij

edited Nov 12, 2013 at 21:27

answered Nov 12, 2013 at 20:38

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

agfe2 Over a year ago

astype('str') is the key point. And for sure, before any column data type change. For check data types in the dataframe, print(df.dtypes) is helpful.

rachwa · Accepted Answer · 2022-05-04 21:06:17Z

39

A more Pythonic way of filtering out rows based on given conditions of other columns and their values:

Assuming a df of:

data = {
    "names": ["Alice", "Zac", "Anna", "O"],
    "cars": ["Civic", "BMW", "Mitsubishi", "Benz"],
    "age": ["1", "4", "2", "0"],
}

df=pd.DataFrame(data)
df:
  age        cars  names
0   1       Civic  Alice
1   4         BMW    Zac
2   2  Mitsubishi   Anna
3   0        Benz      O

Then:

df[
    df["names"].apply(lambda x: len(x) > 1)
    & df["cars"].apply(lambda x: "i" in x)
    & df["age"].apply(lambda x: int(x) < 2)
]

We will have :

  age   cars  names
0   1  Civic  Alice

In the conditions above we are looking first at the length of strings, then we check whether a letter "i" exists in the strings or not, finally, we check for the value of integers in the first column.

edited May 4, 2022 at 21:06

rachwa

2,3901 gold badge21 silver badges20 bronze badges

answered Jul 13, 2017 at 4:02

Mahdi Ghelichi

1,17014 silver badges23 bronze badges

Comments

Julian · Accepted Answer · 2020-05-05 16:57:06Z

22

I personally found this way to be the easiest:

df['column_name'] = df[df['column_name'].str.len()!=10]

edited May 5, 2020 at 16:57

Julian

37.3k24 gold badges138 silver badges196 bronze badges

answered May 5, 2020 at 16:36

spongebob

2312 silver badges4 bronze badges

1 Comment

Smittie Over a year ago

while this seems to be more convenient at first it is slower than using apply, especially on bigger data sets.

rachwa · Accepted Answer · 2022-05-03 21:02:58Z

9

You can also use query:

df.query('A.str.len() == 10 & B.str.len() == 10')

answered May 3, 2022 at 21:02

rachwa

2,3901 gold badge21 silver badges20 bronze badges

Comments

cottontail · Accepted Answer · 2022-07-05 17:45:25Z

For string operations such as this, vanilla Python using built-in methods (without lambda) is much faster than apply() or str.len().

Building a boolean mask by mapping len to each string inside a list comprehension is approx. 40-70% faster than apply() and str.len() respectively.

For multiple columns, zip() allows to evaluate values from different columns concurrently.

col_A_len = map(len, df['A'].astype(str))
col_B_len = map(len, df['B'].astype(str))
m = [a==3 and b==3 for a,b in zip(col_A_len, col_B_len)]
df1 = df[m]

For a single column, drop zip() and loop over the column and check if the length is equal to 3:

df2 = df[[a==3 for a in map(len, df['A'].astype(str))]]

This code can be written a little concisely using the Series.map() method (but a little slower than list comprehension due to pandas overhead):

df2 = df[df['A'].astype(str).map(len)==3]

przemo_li · Accepted Answer · 2013-11-12 18:54:05Z

1

If You have numbers in rows, then they will convert as floats.

Convert all the rows to strings after importing from cvs. For better performance split that lambdas into multiple threads.

answered Nov 12, 2013 at 18:54

przemo_li

4,0734 gold badges37 silver badges62 bronze badges

Comments

Vishal Suryavanshi · Accepted Answer · 2019-02-27 21:08:49Z

1

you can use df.apply(len) . it will give you the result

answered Feb 27, 2019 at 21:08

Vishal Suryavanshi

3771 gold badge4 silver badges16 bronze badges

Comments

Akoffice · Accepted Answer · 2020-06-17 03:22:33Z

1

Filter out values other than length of 10 from column A and B, here i pass lambda expression to map() function. map() function always applies in Series Object.

 df = df[df['A'].map(lambda x: len(str(x)) == 10)]
 df = df[df['B'].map(lambda x: len(str(x)) == 10)]

edited Jun 17, 2020 at 3:22

answered Jun 16, 2020 at 13:18

Akoffice

3912 silver badges6 bronze badges

Comments

Nathan Dos Santos Nunes · Accepted Answer · 2021-01-25 19:12:14Z

0

You could use applymap to filter all columns you want at once, followed by the .all() method to filter only the rows where both columns are True.

#The *mask* variable is a dataframe of booleans, giving you True or False for the selected condition
mask = df[['A','B']].applymap(lambda x: len(str(x)) == 10)

#Here you can just use the mask to filter your rows, using the method *.all()* to filter only rows that are all True, but you could also use the *.any()* method for other needs
df = df[mask.all(axis=1)]

edited Jan 25, 2021 at 19:12

answered Jan 25, 2021 at 18:05

Nathan Dos Santos Nunes

572 bronze badges

Collectives™ on Stack Overflow

Filter string data based on its string length

9 Answers 9

1 Comment

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

1 Comment

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related