0

I have a pandas dataframe in which I'm trying to run some operations on a column of string values which includes some missing data being interpreted as float('nan'), equivalent to:

df = pd.DataFrame({'otherData':[1,2,3,4],'stringColumn':[float('nan'),'Random string one... ','another string..  ','a third string    ']})


DataFrame contents:

otherData    stringColumn
1            nan
2            'Random string one... '
3            'another string..  '
4            ' a third string    '

I want to clean the stringColumn data of the various trailing ellipses and whitespace, and impute empty strings, i.e. '', for nan values.

To do this, I'm using code equivalent to:

df['stringColumn'] = df['stringColumn'].fillna('')
df['stringColumn'] = df['stringColumn'].str.strip()
df['stringColumn'] = df['stringColumn'].str.strip('...')
df['stringColumn'] = df['stringColumn'].str.strip('..')

The problem I'm encountering is that when I run this code in the script I've written, it doesn't work. There are still nan values in my 'stringColumn' column, and there are still some, but not all, ellipses. There are no warning messages. However, when I run the exact same code in the python shell, it works, imputing '' for nan, and cleaning up as desired. I've tried running it in IDLE 3.5.0 and Spyder 3.2.4, with the same result.

2 Answers 2

1

This works nicely for me on pandas v0.20.2, so you might want to try upgrading with

pip install --upgrade pandas

Call str.strip first, and you can do this in one str.replace call.

df.stringColumn = df.stringColumn.fillna('')\
        .str.strip().str.replace(r'((?<=^)\.+)|(\.+(?=$))', '')

0                     
1    Random string one
2       another string
3       a third string
Name: stringColumn, dtype: object

If nan is not a NaNtype, but a string, just modify your regex:

((?<=^)\.+)|(\.+(?=$))|nan

Regex Details

(
(?<=^)    # lookbehind for start of sentence
\.+       # one or more '.'
)
|         # regex OR
(
\.+       # one or more '.'
(?=$)     # lookahead for end of sentence
)

The regex looks for leading or trailing dots (one or more) and removes them.

Sign up to request clarification or add additional context in comments.

10 Comments

To clarify, you ran this in a script, not the shell, and it worked?
@ColeRobertson Uhm, yes.
Hmmmm. After the update, this works in a dummy script for me as well, but not in the real data. I'm sorry not to make the problem replicable, but if I knew how to do that, I'd likely know how to fix it as well... Any further suggestions?
@ColeRobertson See my edit about changing the regex for "nan" strings.
It's not a string, though. isinstance(datapoint,float) yields True.
|
0

Your code works for me as well with pandas==0.20.1.

You can also do this as a one-liner without regexes. The strip() method supports a chars argument of characters to remove from both ends of the string.

df['stringColumn'] = df['stringColumn'].fillna('').str.strip('. ')

Docstring for strip():

S.strip([chars]) -> str

Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.

1 Comment

Yeah, it works for me as well. I think the problem was something to do with idiosyncrasies in the real data, which obviously I can't share, and how pandas was handling those.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.