2

I have a pandas column of strings. I want to break up the individual components of the string into their own columns using regex:

In [35]:

import re
import pandas as pd

In [36]:

data = {'raw': ['Baker 1 2009-11-17       1223.0',
                'Baker 1 2010-06-24       1122.7',
                'Baker 2 2009-07-24       2819.0',
                'Baker 2 2010-08-25       2971.6',
                'Baker 1 2011-01-05       1410.0',
                'Baker 2 2010-09-04       4671.6']}
df = pd.DataFrame(data, columns = ['raw'])
df

Out[36]:
     raw
0    Baker 1 2009-11-17 1223.0
1    Baker 1 2010-06-24 1122.7
2    Baker 2 2009-07-24 2819.0
3    Baker 2 2010-08-25 2971.6
4    Baker 1 2011-01-05 1410.0
5    Baker 2 2010-09-04 4671.6

This is what I want it to look like:

Out[41]:
     name    value   date          score
0    Baker   1       2010-06-24    1223.0
1    Baker   1       2009-07-24    1122.7
2    Baker   2       2009-07-24    2819.0
3    Baker   2       2010-08-25    2971.6
4    Baker   1       2011-01-05    1410.0
5    Baker   2       2010-09-04    4671.6

I've tried data.str.contains() but I can't seem to make it work. Any help would be appreciated.

2 Answers 2

3

Based on this answer - Pandas DataFrame - how do I split a column

In [122]: pd.DataFrame(df['raw'].str.split().tolist(), columns=['name','value','date','score'])
Out[122]: 
    name value        date   score
0  Baker     1  2009-11-17  1223.0
1  Baker     1  2010-06-24  1122.7
2  Baker     2  2009-07-24  2819.0
3  Baker     2  2010-08-25  2971.6
4  Baker     1  2011-01-05  1410.0
5  Baker     2  2010-09-04  4671.6

[6 rows x 4 columns]
Sign up to request clarification or add additional context in comments.

1 Comment

All of the columns here are object dtypes, which I doubt is what OP wanted. You can call convert_objects(convert_numeric=True) on the result to convert numbers, but that still won't convert date to datetime64[ns].
1

Is it a requirement that you use regular expressions? Using regexs here is too complicated, since you have structured data that are easily parsed by read_csv. That said, there are a couple of ways you can do this, in addition to @chrisb's answer:

StringIO + read_csv:

In [45]: data
Out[45]:
{'raw': ['Baker 1 2009-11-17       1223.0',
  'Baker 1 2010-06-24       1122.7',
  'Baker 2 2009-07-24       2819.0',
  'Baker 2 2010-08-25       2971.6',
  'Baker 1 2011-01-05       1410.0',
  'Baker 2 2010-09-04       4671.6']}

In [46]: text = '\n'.join(data['raw'])

In [47]: print(text)
Baker 1 2009-11-17       1223.0
Baker 1 2010-06-24       1122.7
Baker 2 2009-07-24       2819.0
Baker 2 2010-08-25       2971.6
Baker 1 2011-01-05       1410.0
Baker 2 2010-09-04       4671.6

In [48]: from StringIO import StringIO

In [49]: df = pd.read_csv(StringIO(text), sep=r'\s+', parse_dates=[2], names=['name', 'value', 'date', 'score'])

In [50]: df
Out[50]:
    name  value       date   score
0  Baker      1 2009-11-17  1223.0
1  Baker      1 2010-06-24  1122.7
2  Baker      2 2009-07-24  2819.0
3  Baker      2 2010-08-25  2971.6
4  Baker      1 2011-01-05  1410.0
5  Baker      2 2010-09-04  4671.6

In [51]: df.dtypes
Out[51]:
name             object
value             int64
date     datetime64[ns]
score           float64
dtype: object

This allows you provide names and infers the dtypes. I would go for this above the others.

Series.str.extract():

Note: You probably shouldn't use the score regex to match arbitrary floats (e.g., it doesn't match negative numbers): look into tokenize.Floatnumber

In [29]: df
Out[29]:
                               raw
0  Baker 1 2009-11-17       1223.0
1  Baker 1 2010-06-24       1122.7
2  Baker 2 2009-07-24       2819.0
3  Baker 2 2010-08-25       2971.6
4  Baker 1 2011-01-05       1410.0
5  Baker 2 2010-09-04       4671.6

In [30]: raw = df.raw.str.extract(r'(?P<name>[a-zA-Z]+)\s+(?P<value>\d+)\s+(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<score>\d*\.\d*)')

In [31]: raw
Out[31]:
    name value        date   score
0  Baker     1  2009-11-17  1223.0
1  Baker     1  2010-06-24  1122.7
2  Baker     2  2009-07-24  2819.0
3  Baker     2  2010-08-25  2971.6
4  Baker     1  2011-01-05  1410.0
5  Baker     2  2010-09-04  4671.6

In [32]: raw.dtypes
Out[32]:
name     object
value    object
date     object
score    object
dtype: object

In [33]: r = raw.convert_objects(convert_numeric=True)

In [34]: r
Out[34]:
    name  value        date   score
0  Baker      1  2009-11-17  1223.0
1  Baker      1  2010-06-24  1122.7
2  Baker      2  2009-07-24  2819.0
3  Baker      2  2010-08-25  2971.6
4  Baker      1  2011-01-05  1410.0
5  Baker      2  2010-09-04  4671.6

In [35]: r.dtypes
Out[35]:
name      object
value      int64
date      object
score    float64
dtype: object

Note: This didn't convert the date column. Use pandas.to_datetime for that.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.