Python pandas functions work in shell not in script

Question

I have a pandas dataframe in which I'm trying to run some operations on a column of string values which includes some missing data being interpreted as float('nan'), equivalent to:

df = pd.DataFrame({'otherData':[1,2,3,4],'stringColumn':[float('nan'),'Random string one... ','another string..  ','a third string    ']})

DataFrame contents:

otherData    stringColumn
1            nan
2            'Random string one... '
3            'another string..  '
4            ' a third string    '

I want to clean the stringColumn data of the various trailing ellipses and whitespace, and impute empty strings, i.e. '', for nan values.

To do this, I'm using code equivalent to:

df['stringColumn'] = df['stringColumn'].fillna('')
df['stringColumn'] = df['stringColumn'].str.strip()
df['stringColumn'] = df['stringColumn'].str.strip('...')
df['stringColumn'] = df['stringColumn'].str.strip('..')

The problem I'm encountering is that when I run this code in the script I've written, it doesn't work. There are still nan values in my 'stringColumn' column, and there are still some, but not all, ellipses. There are no warning messages. However, when I run the exact same code in the python shell, it works, imputing '' for nan, and cleaning up as desired. I've tried running it in IDLE 3.5.0 and Spyder 3.2.4, with the same result.

cs95 · Accepted Answer · 2017-10-26 19:03:22Z

1

This works nicely for me on pandas v0.20.2, so you might want to try upgrading with

pip install --upgrade pandas

Call str.strip first, and you can do this in one str.replace call.

df.stringColumn = df.stringColumn.fillna('')\
        .str.strip().str.replace(r'((?<=^)\.+)|(\.+(?=$))', '')

0                     
1    Random string one
2       another string
3       a third string
Name: stringColumn, dtype: object

If nan is not a NaNtype, but a string, just modify your regex:

((?<=^)\.+)|(\.+(?=$))|nan

Regex Details

(
(?<=^)    # lookbehind for start of sentence
\.+       # one or more '.'
)
|         # regex OR
(
\.+       # one or more '.'
(?=$)     # lookahead for end of sentence
)

The regex looks for leading or trailing dots (one or more) and removes them.

edited Oct 26, 2017 at 19:03

answered Oct 26, 2017 at 18:55

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Cole Robertson Over a year ago

To clarify, you ran this in a script, not the shell, and it worked?

cs95 Over a year ago

@ColeRobertson Uhm, yes.

Cole Robertson Over a year ago

Hmmmm. After the update, this works in a dummy script for me as well, but not in the real data. I'm sorry not to make the problem replicable, but if I knew how to do that, I'd likely know how to fix it as well... Any further suggestions?

cs95 Over a year ago

@ColeRobertson See my edit about changing the regex for "nan" strings.

Cole Robertson Over a year ago

It's not a string, though. isinstance(datapoint,float) yields True.

|

tdube · Accepted Answer · 2017-10-26 19:30:00Z

0

Your code works for me as well with pandas==0.20.1.

You can also do this as a one-liner without regexes. The strip() method supports a chars argument of characters to remove from both ends of the string.

df['stringColumn'] = df['stringColumn'].fillna('').str.strip('. ')

Docstring for strip():

S.strip([chars]) -> str

Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.

answered Oct 26, 2017 at 19:30

tdube

2,5732 gold badges18 silver badges25 bronze badges

1 Comment

Cole Robertson Over a year ago

Yeah, it works for me as well. I think the problem was something to do with idiosyncrasies in the real data, which obviously I can't share, and how pandas was handling those.

Collectives™ on Stack Overflow

Python pandas functions work in shell not in script

2 Answers 2

10 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related