Finding exact regex match for string in column

Question

This is my Dataframe:

                         entityId  delta_approved_clockout
 (ID: 10)              247333605                      0.0  
  (ID: 20)               36738870                      0.0  
  (ID: 40)             4668036427                      0.0  
  (ID: 50)             1918647972                      0.0  
  (ID: 60)             4323165902                  44125.0  
  (ID: 80)              145512255                      0.0  
 Assigned (ID: 30)       42050340                      0.0  
 Assigned (ID: 40)   130880371376                      0.0  
 Assigning (ID: 30)    1095844753                      0.0  
 Cancelled (ID: 40)        937280                      0.0  
 Cancelled (ID: 80)   16857720813                      0.0  
 Planned (ID: 20)      9060392597                      0.0  
 Planning (ID: 10)   108484297031                      0.0  
 Processed (ID: 70)  133289880880                      0.0  
 Revoked (ID: 50)      2411903072                      0.0  
 Writing (ID: 50)    146408550024                      0.0  
 Written (ID: 60)    139458227923                1018230.0

I want the result to only print the exact regex match for '(ID: 10)', using this line my input includes 'Planning (ID: 10)', which is not the exact match I need. These are the summed results:

                        entityId  delta_approved_clockout  
last_status                                                
  (ID: 10)             247333605                      0.0  
 Planning (ID: 10)  108484297031                      0.0  

print input_data[input_data['last_status'].str.contains(r'(?<!\S)\(ID: 10\)(?!\S)', na=False)]

I have also tried regex codes that gave 0 results such as:

print input_data[input_data['last_status'].str.contains(r' ^(\(ID: \d+\))$', na=False)]

print input_data[input_data['last_status'].str.contains(r'^(\(ID: 10\))$', na=False)]

Perhaps I don't understand regex thoroughly, what would be the correct way of writing regex? Thanks in advance.

you want to do it only with regex? You can simply do it using datafram slicing like df=df['last_status' == '(ID: 10)'] — Sociopath
– Sociopath, Commented Feb 5, 2018 at 11:08
Try regex101.com/r/4Cb8as/1 if you don't want capturing groups remove () — Srdjan M.
– Srdjan M., Commented Feb 5, 2018 at 11:11
df=df['last_status' == '(ID: 10)'' includes: Planning (ID: 10), I want solely the matches with (ID: 10) — R A
– R A, Commented Feb 5, 2018 at 11:18
@WiktorStribiżew and S.Kablar, your implementations seem to be working. thanks. I'm still struggling understanding the correct use of regex codes. But I guess practice makes perfect. Ty — R A
– R A, Commented Feb 5, 2018 at 11:23

Wiktor Stribiżew · Accepted Answer · 2018-02-05 11:24:53Z

1

You may use

r'^\s*\(ID:\s*\d+\)\s*$'

See the regex demo.

The pattern matches:

^ - start of string
\s* - zero or more (*) whitespace chars
\(ID: - a (ID: substring
\s* - zero or more (*) whitespace chars
\d+ - 1+ digits
\) - a ) char
\s* - zero or more (*) whitespace chars
$ - end of string.

answered Feb 5, 2018 at 11:24

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

The fourth bird · Accepted Answer · 2018-02-05 11:17:37Z

1

If you want to get the whole line, you could update your regex to ^\s*$ID: 10$.*$

To capture (ID: 10) in a group, you could try ^\s*($ID:\s*10$).*$

answered Feb 5, 2018 at 11:17

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

1 Comment

kfairns Over a year ago

Alternatively, using a positive lookahead: (?=.*ID: 10)^(.*)$

Srdjan M. · Accepted Answer · 2018-02-05 11:46:35Z

Regex: ^\s*$ID:\s10$[^\r\n]+

Details:

^ Asserts position at start of a line
\s matches any whitespace character
* Matches between zero and unlimited time
[^] Match a single character not present in the list
+ Matches between one and unlimited time
\r\n Matches a carriage return and line-feed (newline) character

Python code:

dataframe = """ (ID: 20)              247333605                      0.0  
  (ID: 50)               36738870                      0.0  
  (ID: 40)             4668036427                      0.0  
  (ID: 50)             1918647972                      0.0  
  (ID: 60)             4323165902                  44125.0  
  (ID: 10)              145512255                      0.0  
 Assigned (ID: 30)       42050340                      0.0  
 Assigned (ID: 40)   130880371376                      0.0  
 Assigning (ID: 30)    1095844753                      0.0  
 Cancelled (ID: 40)        937280                      0.0  
 Cancelled (ID: 80)   16857720813                      0.0  
 Planned (ID: 20)      9060392597                      0.0  
 Planning (ID: 10)   108484297031                      0.0  
 Processed (ID: 70)  133289880880                      0.0  
 Revoked (ID: 50)      2411903072                      0.0  
 Writing (ID: 50)    146408550024                      0.0  
 Written (ID: 60)    139458227923                1018230.0 """

def ID(id, data):
        return re.findall(r'^\s*\(ID:\s%s\)[^\r\n]+' % id, data, re.MULTILINE)

ID(10, dataframe) >> ['  (ID: 10)              145512255                      0.0  ']

Joe · Accepted Answer · 2018-02-05 11:27:30Z

0

This should work:

input_data = input_data[(input_data['last_status'] == '(ID: 10)')]

answered Feb 5, 2018 at 11:27

Joe

12.4k7 gold badges44 silver badges58 bronze badges

Collectives™ on Stack Overflow

Finding exact regex match for string in column

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related