2

I am trying to clean a column called 'historical_rank' in a pandas dataframe. It contains string data. Here is a sample of the content:

       historical_rank
...    ...
122    1908
123    O'   
124 
125    1911  
126    1912  
127    1913 * * * 2010 * * *  
128
129    1914  
130    1915
131  
132
133    1918  
134    (First served 1989 to 1999)
...    ...

The data I want to retain are the four-digit numbers in rows 122, 125, 126, 127, 129, 130, and 133. Elsewhere in the series that number (the historical rank) may be one, two, or three digits. It always begins the string, and there is always a space after it. I want to use regex to keep the desired pattern -- r'\d{1,4}(?=\s)' -- and remove everything else throughout the series. What is the correct code to achieve this? Thank you.

5
  • Have a look into the pd.Series.str.extract function. You can assign said date value into a new DataFrame column, then drop the existing column. Docs here. Commented May 25, 2020 at 20:37
  • How can a year be a single digit value? Maybe you want to extract the two or four digits at the start of string followed with whitespace? Commented May 25, 2020 at 20:48
  • @Ed_in_NY, could you please include other scenarios other than a four year digit start in you sample data if there is any? Commented May 25, 2020 at 20:49
  • If you only want to keep the dates in rows 122, 125, 126, 127, 129, 130, and 133 could you not use r'(1908)|(1911)|(1912)|(1913)|(1915)|(1918)' combine with @S3DEV suggestion? Commented May 25, 2020 at 20:55
  • The historical ranking isn't a year, although it may look like that in the sample I provided. The data are all people who have served as US senators. There have been 1984 senators to date. The senators of the First Congress, for instance, all had historical rankings of one or two digits. Commented May 25, 2020 at 20:59

5 Answers 5

3

As an alternative, you could use str.replace and use a pattern with a capturing group to keep what you want, and match what you want to remove.

  • ^ Start of string
  • ( Capture group 1 (Keep)
    • \d{1,4} Match 1-4 digits
  • ) Close group
  • \s Match a whitespace char
  • | Or
  • .+ Match any char 1+ times

In the replacement, use group 1 r'\1'

^(\d{1,4})\s|.+

Regex demo

For example

df.historical_rank = df.historical_rank.str.replace(r"^(\d{1,4})\s|.+", r'\1')
Sign up to request clarification or add additional context in comments.

Comments

2

IICU

df['historical_rank_new']=df['historical_rank'].str.extract('(^[\d]{1,4})')
df

1 Comment

With slight modification to the regex -- (^[\d]{1,4}) -- this solution has provided the best result. Thanks.
1

You should think of using your regex in a str.extract method keeping in mind that this method requires the regex to contain at least one capturing group.

If you plan to match one, two, three or four digits at the start of the string that are followed by at least one whitespace (just judging by your \d{1,4}(?=\s) pattern) you should try

df['historical_rank_clean'] = df['historical_rank'].str.extract('^(\d{1,4})\s', expand=False).fillna('')

Note the (...) in the pattern, the paretheses form a capturing group and its contents will be used to fill the cells in the new historical_rank_clean column. .fillna('') will populate those entries with no match with an empty string.

Some other regex ideas:

  • r'^(\d{2}(?:\d{2})?)\b' - extract two- or four-digit chunks at the start of the string that are followed with a word boundary
  • r'^((?:20|19)?\d{2})\b' - similar to above, but only allowing years starting with 19 or 20 if these are four-digit years.

See the regex demo

Comments

0

To improve @wwnde answer, you could use:

df['historical_rank_new']=df['historical_rank'].str.extract('(^\d{1,4}$)')

Here is on repl.it

Comments

0

I guess it would be more simple and efficient: df['historical_rank_new'] = df['historical_rank'].str.extract('(\d{4})')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.