1

I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following

for i in range(len(table["series_id"])):
    table["state_code"] = table["series_id"][i][2:4]
    table["area_code"] = table["series_id"][i][5:9]
    table["supersector_code"] = table["series_id"][i][11:12]

where "series_id" is the string containing multiple information fields I want to create an example data element:

columns:

 [series_id, year, month, value, footnotes]

The data:

[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
 ['SMS01000000000000001' '2006' 'M02' 1970.4 '']
 ['SMS01000000000000001' '2006' 'M03' 1976.6 '']

However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.

http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern

has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:

table["state_code"] = table["series_id"].str.get(1:3)

or

table["state_code"] = table["series_id"].str.slice(1:3)

or

table["state_code"] = table["series_id"].str.slice([1:3])

When I have tried the following functions I get an invalid syntax for the ":".

but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.

Thank you

2
  • 1
    I think what you want is table["state_code"] = table["series_id"].str[1:3] Commented Mar 3, 2014 at 21:54
  • Note: that's a really bad way to iterate over the rows, either use iterrows or apply. Using range like that creates a huge python list (in python 2), xrange is slightly better. Commented Mar 3, 2014 at 21:57

1 Answer 1

4

I think I would use str.extract with some regex (which you can tweak for your needs):

In [11]: s = pd.Series(["SMU78000009092000001"])

In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]: 
  state_code area_code supersector_code
0        U78      0000               92

This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...

Sign up to request clarification or add additional context in comments.

3 Comments

Just curious, is the 'Out[12]' returning a Data frame?
@user3376660 yep, that's a DataFrame, with the group names you're extracting as column names :)
@user3376660 you'll probably need to tweak the numbers a bit to suit your needs!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.