1

This is my Dataframe:

                         entityId  delta_approved_clockout
 (ID: 10)              247333605                      0.0  
  (ID: 20)               36738870                      0.0  
  (ID: 40)             4668036427                      0.0  
  (ID: 50)             1918647972                      0.0  
  (ID: 60)             4323165902                  44125.0  
  (ID: 80)              145512255                      0.0  
 Assigned (ID: 30)       42050340                      0.0  
 Assigned (ID: 40)   130880371376                      0.0  
 Assigning (ID: 30)    1095844753                      0.0  
 Cancelled (ID: 40)        937280                      0.0  
 Cancelled (ID: 80)   16857720813                      0.0  
 Planned (ID: 20)      9060392597                      0.0  
 Planning (ID: 10)   108484297031                      0.0  
 Processed (ID: 70)  133289880880                      0.0  
 Revoked (ID: 50)      2411903072                      0.0  
 Writing (ID: 50)    146408550024                      0.0  
 Written (ID: 60)    139458227923                1018230.0  

I want the result to only print the exact regex match for '(ID: 10)', using this line my input includes 'Planning (ID: 10)', which is not the exact match I need. These are the summed results:

                        entityId  delta_approved_clockout  
last_status                                                
  (ID: 10)             247333605                      0.0  
 Planning (ID: 10)  108484297031                      0.0  

print input_data[input_data['last_status'].str.contains(r'(?<!\S)\(ID: 10\)(?!\S)', na=False)]

I have also tried regex codes that gave 0 results such as:

print input_data[input_data['last_status'].str.contains(r' ^(\(ID: \d+\))$', na=False)]

print input_data[input_data['last_status'].str.contains(r'^(\(ID: 10\))$', na=False)]

Perhaps I don't understand regex thoroughly, what would be the correct way of writing regex? Thanks in advance.

7
  • Try r'^\s*(\(ID:\s*\d+\))\s*$' Commented Feb 5, 2018 at 11:08
  • you want to do it only with regex? You can simply do it using datafram slicing like df=df['last_status' == '(ID: 10)'] Commented Feb 5, 2018 at 11:08
  • Try regex101.com/r/4Cb8as/1 if you don't want capturing groups remove () Commented Feb 5, 2018 at 11:11
  • df=df['last_status' == '(ID: 10)'' includes: Planning (ID: 10), I want solely the matches with (ID: 10) Commented Feb 5, 2018 at 11:18
  • @WiktorStribiżew and S.Kablar, your implementations seem to be working. thanks. I'm still struggling understanding the correct use of regex codes. But I guess practice makes perfect. Ty Commented Feb 5, 2018 at 11:23

4 Answers 4

1

You may use

r'^\s*\(ID:\s*\d+\)\s*$'

See the regex demo.

The pattern matches:

  • ^ - start of string
  • \s* - zero or more (*) whitespace chars
  • \(ID: - a (ID: substring
  • \s* - zero or more (*) whitespace chars
  • \d+ - 1+ digits
  • \) - a ) char
  • \s* - zero or more (*) whitespace chars
  • $ - end of string.
Sign up to request clarification or add additional context in comments.

Comments

1

If you want to get the whole line, you could update your regex to ^\s*\(ID: 10\).*$

To capture (ID: 10) in a group, you could try ^\s*(\(ID:\s*10\)).*$

1 Comment

Alternatively, using a positive lookahead: (?=.*ID: 10)^(.*)$
1

Regex: ^\s*\(ID:\s10\)[^\r\n]+

Details:

  • ^ Asserts position at start of a line
  • \s matches any whitespace character
  • * Matches between zero and unlimited time
  • [^] Match a single character not present in the list
  • + Matches between one and unlimited time
  • \r\n Matches a carriage return and line-feed (newline) character

Python code:

dataframe = """ (ID: 20)              247333605                      0.0  
  (ID: 50)               36738870                      0.0  
  (ID: 40)             4668036427                      0.0  
  (ID: 50)             1918647972                      0.0  
  (ID: 60)             4323165902                  44125.0  
  (ID: 10)              145512255                      0.0  
 Assigned (ID: 30)       42050340                      0.0  
 Assigned (ID: 40)   130880371376                      0.0  
 Assigning (ID: 30)    1095844753                      0.0  
 Cancelled (ID: 40)        937280                      0.0  
 Cancelled (ID: 80)   16857720813                      0.0  
 Planned (ID: 20)      9060392597                      0.0  
 Planning (ID: 10)   108484297031                      0.0  
 Processed (ID: 70)  133289880880                      0.0  
 Revoked (ID: 50)      2411903072                      0.0  
 Writing (ID: 50)    146408550024                      0.0  
 Written (ID: 60)    139458227923                1018230.0 """

def ID(id, data):
        return re.findall(r'^\s*\(ID:\s%s\)[^\r\n]+' % id, data, re.MULTILINE)

ID(10, dataframe) >> ['  (ID: 10)              145512255                      0.0  ']

Comments

0

This should work:

input_data = input_data[(input_data['last_status'] == '(ID: 10)')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.