sub string python pandas

Question

I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following

for i in range(len(table["series_id"])):
    table["state_code"] = table["series_id"][i][2:4]
    table["area_code"] = table["series_id"][i][5:9]
    table["supersector_code"] = table["series_id"][i][11:12]

where "series_id" is the string containing multiple information fields I want to create an example data element:

columns:

 [series_id, year, month, value, footnotes]

The data:

[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
 ['SMS01000000000000001' '2006' 'M02' 1970.4 '']
 ['SMS01000000000000001' '2006' 'M03' 1976.6 '']

However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.

http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern

has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:

table["state_code"] = table["series_id"].str.get(1:3)

or

table["state_code"] = table["series_id"].str.slice(1:3)

or

table["state_code"] = table["series_id"].str.slice([1:3])

When I have tried the following functions I get an invalid syntax for the ":".

but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.

Thank you

I think what you want is table["state_code"] = table["series_id"].str[1:3] — EdChum
– EdChum, Commented Mar 3, 2014 at 21:54
Note: that's a really bad way to iterate over the rows, either use iterrows or apply. Using range like that creates a huge python list (in python 2), xrange is slightly better. — Andy Hayden
– Andy Hayden, Commented Mar 3, 2014 at 21:57

Andy Hayden · Accepted Answer · 2014-03-03 21:51:55Z

4

I think I would use str.extract with some regex (which you can tweak for your needs):

In [11]: s = pd.Series(["SMU78000009092000001"])

In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]: 
  state_code area_code supersector_code
0        U78      0000               92

This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...

answered Mar 3, 2014 at 21:51

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user3376660 Over a year ago

Just curious, is the 'Out[12]' returning a Data frame?

Andy Hayden Over a year ago

@user3376660 yep, that's a DataFrame, with the group names you're extracting as column names :)

Andy Hayden Over a year ago

@user3376660 you'll probably need to tweak the numbers a bit to suit your needs!

Collectives™ on Stack Overflow

sub string python pandas

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related