1

I have a pandas DataFrame which has 3 digits (string) such as '001' , '010' and '121'. I would like to replace any 1 digit and any 2 digit strings such as '001' , and '010' with just '1' and '10'.

How can I do this? I tried using the apply method (see below) but nothing changes.

df.ZIPCOUNTY_CA is the pandas dataframe and 'county code' is the column which has these string digits.

df_ZIPCOUNTY_CA[df_ZIPCOUNTY_CA['county code'].str.startswith('0')]['county codes'] = df_ZIPCOUNTY_CA[df_ZIPCOUNTY_CA['county code'].str.startswith('0')]['county code'].apply(lambda x: x.split('0')[1])
0

3 Answers 3

3

Or use str.replace to remove leading zeros:

df_ZIPCOUNTY_CA['county code']

#0    010
#1    001
#2    121
#Name: county code, dtype: object

df_ZIPCOUNTY_CA['county code'].str.replace('^0+', '')

#0     10
#1      1
#2    121
#Name: county code, dtype: object

^0+ is a regular expression; ^ matches the beginning of string, 0 matches literal 0, and + is quantifier stands for one or more; Together ^0+ matches all zeros that starts from the beginning of string.

Here is a little timing about the two approaches.

df_ZIPCOUNTY_CA = pd.DataFrame([['010'], ['001'], ['121']], columns=['county code'])
​
df_ZIPCOUNTY_CA = pd.concat([df_ZIPCOUNTY_CA] * 10000)

%timeit df_ZIPCOUNTY_CA['county code'].str.replace('^0+', '')
# 10 loops, best of 3: 37.1 ms per loop

%timeit df_ZIPCOUNTY_CA['county code'].astype(int).astype(str)
# 10 loops, best of 3: 70.8 ms per loop

Or as @Bill commented, might just use str.lstrip, the fastest approach here:

%timeit df_ZIPCOUNTY_CA['county code'].str.lstrip('0')
# 100 loops, best of 3: 8.9 ms per loop

# added the map str approach for comparison as well
%timeit df_ZIPCOUNTY_CA['county code'].astype(int).map(str)
# 100 loops, best of 3: 13.3 ms per loop
Sign up to request clarification or add additional context in comments.

7 Comments

thank you. Can you please let me know what '^0+' is and where I can learn more about it?
Explanation added. You can look into the regex info if you want to learn more about regex.
Would be simpler to use str.lstrip() no?
@Bill Yes, you are right. That could be the best approach here.
Found the list[x.lstrip('0') for x in df.A] is the fastest so far,
|
2

You can convert your series to int and then to str.

df_ZIPCOUNTY_CA['county code'] = df_ZIPCOUNTY_CA['county code'].astype(int).astype(str)

Example

df = pd.DataFrame({'A': ['001', '010', '100']})

df['A'] = df['A'].astype(int).map(str)

print(df)

#      A
# 0    1
# 1   10
# 2  100

Performance benchmarking

df = pd.DataFrame({'A': ['001', '010', '100']})

df = pd.concat([df]*10000, ignore_index=True)

%timeit df['A'].astype(int).map(str)    # 21.6 ms
%timeit df['A'].str.replace('^0+', '')  # 77.2 ms

Comments

1

BY using to_numeric

pd.to_numeric(df.A)
Out[66]: 
0      1
1     10
2    100
Name: A, dtype: int64

Or using lstrip in python (not pandas str.lstrip)

[x.lstrip('0') for x in df.A]

Timing : the loop is faster ??...

%timeit [x.lstrip('0') for x in df.A]
100 loops, best of 3: 5.26 ms per loop
%timeit df['A'].str.lstrip('0')
100 loops, best of 3: 10 ms per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.