Formatting String Digits in Python Pandas

Question

I have a pandas DataFrame which has 3 digits (string) such as '001' , '010' and '121'. I would like to replace any 1 digit and any 2 digit strings such as '001' , and '010' with just '1' and '10'.

How can I do this? I tried using the apply method (see below) but nothing changes.

df.ZIPCOUNTY_CA is the pandas dataframe and 'county code' is the column which has these string digits.

df_ZIPCOUNTY_CA[df_ZIPCOUNTY_CA['county code'].str.startswith('0')]['county codes'] = df_ZIPCOUNTY_CA[df_ZIPCOUNTY_CA['county code'].str.startswith('0')]['county code'].apply(lambda x: x.split('0')[1])

akuiper · Accepted Answer · 2018-04-08 03:31:47Z

3

Or use str.replace to remove leading zeros:

df_ZIPCOUNTY_CA['county code']

#0    010
#1    001
#2    121
#Name: county code, dtype: object

df_ZIPCOUNTY_CA['county code'].str.replace('^0+', '')

#0     10
#1      1
#2    121
#Name: county code, dtype: object

^0+ is a regular expression; ^ matches the beginning of string, 0 matches literal 0, and + is quantifier stands for one or more; Together ^0+ matches all zeros that starts from the beginning of string.

Here is a little timing about the two approaches.

df_ZIPCOUNTY_CA = pd.DataFrame([['010'], ['001'], ['121']], columns=['county code'])

df_ZIPCOUNTY_CA = pd.concat([df_ZIPCOUNTY_CA] * 10000)

%timeit df_ZIPCOUNTY_CA['county code'].str.replace('^0+', '')
# 10 loops, best of 3: 37.1 ms per loop

%timeit df_ZIPCOUNTY_CA['county code'].astype(int).astype(str)
# 10 loops, best of 3: 70.8 ms per loop

Or as @Bill commented, might just use str.lstrip, the fastest approach here:

%timeit df_ZIPCOUNTY_CA['county code'].str.lstrip('0')
# 100 loops, best of 3: 8.9 ms per loop

# added the map str approach for comparison as well
%timeit df_ZIPCOUNTY_CA['county code'].astype(int).map(str)
# 100 loops, best of 3: 13.3 ms per loop

edited Apr 8, 2018 at 3:31

answered Apr 8, 2018 at 3:13

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

paulnsn Over a year ago

thank you. Can you please let me know what '^0+' is and where I can learn more about it?

akuiper Over a year ago

Explanation added. You can look into the regex info if you want to learn more about regex.

Bill Over a year ago

Would be simpler to use str.lstrip() no?

akuiper Over a year ago

@Bill Yes, you are right. That could be the best approach here.

BENY Over a year ago

Found the list[x.lstrip('0') for x in df.A] is the fastest so far,

|

jpp · Accepted Answer · 2018-04-08 03:26:19Z

2

You can convert your series to int and then to str.

df_ZIPCOUNTY_CA['county code'] = df_ZIPCOUNTY_CA['county code'].astype(int).astype(str)

Example

df = pd.DataFrame({'A': ['001', '010', '100']})

df['A'] = df['A'].astype(int).map(str)

print(df)

#      A
# 0    1
# 1   10
# 2  100

Performance benchmarking

df = pd.DataFrame({'A': ['001', '010', '100']})

df = pd.concat([df]*10000, ignore_index=True)

%timeit df['A'].astype(int).map(str)    # 21.6 ms
%timeit df['A'].str.replace('^0+', '')  # 77.2 ms

edited Apr 8, 2018 at 3:26

answered Apr 8, 2018 at 3:11

jpp

166k37 gold badges301 silver badges362 bronze badges

Comments

BENY · Accepted Answer · 2018-04-08 03:46:51Z

1

BY using to_numeric

pd.to_numeric(df.A)
Out[66]: 
0      1
1     10
2    100
Name: A, dtype: int64

Or using lstrip in python (not pandas str.lstrip)

[x.lstrip('0') for x in df.A]

Timing : the loop is faster ??...

%timeit [x.lstrip('0') for x in df.A]
100 loops, best of 3: 5.26 ms per loop
%timeit df['A'].str.lstrip('0')
100 loops, best of 3: 10 ms per loop

edited Apr 8, 2018 at 3:46

answered Apr 8, 2018 at 3:40

BENY

324k22 gold badges176 silver badges250 bronze badges

Collectives™ on Stack Overflow

Formatting String Digits in Python Pandas

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related