Python Regular Expression - Pattern Matching

Question

This is my first experience with pattern matching using regular expressions so any help is appreciated.

I am trying to search a string for the following substrings:

"(TPU 1-999)
http://somewebaddress.com"

I want to keep TPU, 1-999 and the link as separate substrings.

This is the pattern I am using:

^\s{3}\(([AEINPRSTUW]{3})\s(\d{1,3}.\d{2,5})\)$^\s{3}(http+\s{1,100})$

I'll break it down to explain my reasoning

^\s{3} - beginning of string (or line in this case), followed by 3 spaces

\( - left parentheses

([AEINPRSTUW]{3}) - 3 instances of any of the letters in brackets, TPU being one example

\s(\d{1,3}.\d{2,5}) - a space and then 1-3 numeric digits, separated by any char from 2-5 more numeric digits

\)$ - right parentheses, end of line

^\s{3} - beginning of next line followed by three spaces

(http+\s{1,100})$ - the characters "http" followed by anywhere between 1 and 100 non whitespace characters, and the end of the line.

This pattern doesn't work right now but am I headed in the right direction?

Are those " actually part of your string? And where are those three spaces you are trying to match? — Martin Ender
– Martin Ender, Commented Oct 25, 2012 at 15:48

Martin Ender · Accepted Answer · 2012-10-25 19:24:48Z

4

$^ this cannot work. $ is the end of line (before the line break), ^ is the beginning of a line (after the line break). But the line break is a character (or two), while do not advance the position of the regex engine. So $ and ^ try to match the same position, which can only ever happen if they are the ending and beginning of an empty line - and even then putting them in this order would be greatly misleading. If you want to make sure that there is exactly one line break between them try this:

^\s{3}\(([AEINPRSTUW]{3})\s(\d{1,3}.\d{2,5})\)$(\r\n?|\n)^\s{3}(http+\S{1,100})$

However, as ridgerunner pointed out the comment, the following \s{3} could match (up to 3) more linebreaks, since they are whitespace as well.

Also note that . as a separator of your numbers might not be the best idea. At least, use a non-digit character:

^\s{3}\(([AEINPRSTUW]{3})\s(\d{1,3}\D\d{2,5})\)$(\r\n?|\n)^\s{3}(http+\S{1,100})$

Note also that I have changed your last \s to \S (because \s is whitespace, \S is non-whitespace).

Also note, that the string you have shown us does not contain those three whitespaces you are trying to match. So making them optional (as CaptainMurphy suggested) might be helpful, too:

^\s*\(([AEINPRSTUW]{3})\s(\d{1,3}\D\d{2,5})\)$(\r\n?|\n)^\s*(http+\S{1,100})$

And since we are already matching that line break, we could also remove those anchors there completely, they do not really help any more:

^\s*\(([AEINPRSTUW]{3})\s(\d{1,3}\D\d{2,5})\)(\r\n?|\n)\s*(http+\S{1,100})$

edited Oct 25, 2012 at 19:24

answered Oct 25, 2012 at 15:44

Martin Ender

44.4k11 gold badges93 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ridgerunner Over a year ago

+1 but a couple points... First, technically, $^, by itself, does match an empty line (as does ^$) - the order of multiple adjacent zero-width assertions does not matter (although you are correct that in the context of this regex it will never match). Second, the (\r\n?|\n)\s* does not guarantee only one new line as the \s* matches both carriage returns and linefeeds. Otherwise, nice explanation.

Martin Ender Over a year ago

@ridgerunner, you are absolutely right of course! I shall add that for clarification

engineerC · Accepted Answer · 2012-10-25 15:54:33Z

1

I think you're being overly specific with things like your uppercase letters and specific amount of whitespace (your example string doesn't even have whitespace at the beginning). I mostly just stick to * and + unless I'm looking for something very specific. As another answer pointed out, $ is end of the entire record (string), not the end of the line. A newline or CRLF is just whitespace. Don't use \s or even [^\s] for nonwhitespace, use \S.

ss="(TPU 1-999)\nhttp://something.com"
rr="^\s*\(([A-Z]+)\s+(\d+.\d+)\)\s+(http\S{1,100})$"
re.match(rr,ss).groups()
('TPU', '1-999', 'http://something.com')

edited Oct 25, 2012 at 15:54

answered Oct 25, 2012 at 15:49

engineerC

2,8781 gold badge19 silver badges32 bronze badges

1 Comment

TheMightyAlpaca Over a year ago

Thanks very much, really helped quite a bit.

TheMightyAlpaca · Accepted Answer · 2012-10-25 18:27:33Z

I was really over-thinking this. Here is the solution I came up with based upon the answers I was provided:

Here is an example of the string I am parsing (pulled from the content of an email message):

'The writeboard named "10/26 newsletters (Pat)" has been created:\r\n\r\n (TPU 1000+)\r\n\r\n http://www.techproductupdate.com/resources/2313/splunk-app-for-vmware-delivers-insight-into-the-cloud\r\n\r\n (TIN 250+)\r\n\r\n http://www.techproductupdate.com/resources/2369/securing-mysql-databases\r\n\r\n (TPU 500+)\r\n\r\n http://www.techproductupdate.com/resources/2333/designing-a-data-protection-strategy-with-hp-lefthand-hp-storeonce-and-hp-tape\r\n\r\n- - -\r\nYou can visit the writeboard at:\r\n http://somewebsite.com\r\n'

So first I just use re.findall to locate everything between parentheses using the pattern '$(?P<list>[A-Z]*)\s(?P<segments>.+)$'

Then I use re.findall to locate all of the URLs using the pattern 'http\S*' - this returns all the results I want with the extra 'http://somewebsite.com' at the end of the list.

Then I just zip these lists together, excluding the last element of the last list and I essentially get the results I was looking for in the first place.

Collectives™ on Stack Overflow

Python Regular Expression - Pattern Matching

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related