1

This is my first experience with pattern matching using regular expressions so any help is appreciated.

I am trying to search a string for the following substrings:

"(TPU 1-999)
http://somewebaddress.com"

I want to keep TPU, 1-999 and the link as separate substrings.

This is the pattern I am using:

^\s{3}\(([AEINPRSTUW]{3})\s(\d{1,3}.\d{2,5})\)$^\s{3}(http+\s{1,100})$

I'll break it down to explain my reasoning

^\s{3} - beginning of string (or line in this case), followed by 3 spaces

\( - left parentheses

([AEINPRSTUW]{3}) - 3 instances of any of the letters in brackets, TPU being one example

\s(\d{1,3}.\d{2,5}) - a space and then 1-3 numeric digits, separated by any char from 2-5 more numeric digits

\)$ - right parentheses, end of line

^\s{3} - beginning of next line followed by three spaces

(http+\s{1,100})$ - the characters "http" followed by anywhere between 1 and 100 non whitespace characters, and the end of the line.

This pattern doesn't work right now but am I headed in the right direction?

1
  • Are those " actually part of your string? And where are those three spaces you are trying to match? Commented Oct 25, 2012 at 15:48

3 Answers 3

4

$^ this cannot work. $ is the end of line (before the line break), ^ is the beginning of a line (after the line break). But the line break is a character (or two), while do not advance the position of the regex engine. So $ and ^ try to match the same position, which can only ever happen if they are the ending and beginning of an empty line - and even then putting them in this order would be greatly misleading. If you want to make sure that there is exactly one line break between them try this:

^\s{3}\(([AEINPRSTUW]{3})\s(\d{1,3}.\d{2,5})\)$(\r\n?|\n)^\s{3}(http+\S{1,100})$

However, as ridgerunner pointed out the comment, the following \s{3} could match (up to 3) more linebreaks, since they are whitespace as well.

Also note that . as a separator of your numbers might not be the best idea. At least, use a non-digit character:

^\s{3}\(([AEINPRSTUW]{3})\s(\d{1,3}\D\d{2,5})\)$(\r\n?|\n)^\s{3}(http+\S{1,100})$

Note also that I have changed your last \s to \S (because \s is whitespace, \S is non-whitespace).

Also note, that the string you have shown us does not contain those three whitespaces you are trying to match. So making them optional (as CaptainMurphy suggested) might be helpful, too:

^\s*\(([AEINPRSTUW]{3})\s(\d{1,3}\D\d{2,5})\)$(\r\n?|\n)^\s*(http+\S{1,100})$

And since we are already matching that line break, we could also remove those anchors there completely, they do not really help any more:

^\s*\(([AEINPRSTUW]{3})\s(\d{1,3}\D\d{2,5})\)(\r\n?|\n)\s*(http+\S{1,100})$
Sign up to request clarification or add additional context in comments.

2 Comments

+1 but a couple points... First, technically, $^, by itself, does match an empty line (as does ^$) - the order of multiple adjacent zero-width assertions does not matter (although you are correct that in the context of this regex it will never match). Second, the (\r\n?|\n)\s* does not guarantee only one new line as the \s* matches both carriage returns and linefeeds. Otherwise, nice explanation.
@ridgerunner, you are absolutely right of course! I shall add that for clarification
1

I think you're being overly specific with things like your uppercase letters and specific amount of whitespace (your example string doesn't even have whitespace at the beginning). I mostly just stick to * and + unless I'm looking for something very specific. As another answer pointed out, $ is end of the entire record (string), not the end of the line. A newline or CRLF is just whitespace. Don't use \s or even [^\s] for nonwhitespace, use \S.

ss="(TPU 1-999)\nhttp://something.com"
rr="^\s*\(([A-Z]+)\s+(\d+.\d+)\)\s+(http\S{1,100})$"
re.match(rr,ss).groups()
('TPU', '1-999', 'http://something.com')

1 Comment

Thanks very much, really helped quite a bit.
1

I was really over-thinking this. Here is the solution I came up with based upon the answers I was provided:

Here is an example of the string I am parsing (pulled from the content of an email message):

'The writeboard named "10/26 newsletters (Pat)" has been created:\r\n\r\n (TPU 1000+)\r\n\r\n http://www.techproductupdate.com/resources/2313/splunk-app-for-vmware-delivers-insight-into-the-cloud\r\n\r\n (TIN 250+)\r\n\r\n http://www.techproductupdate.com/resources/2369/securing-mysql-databases\r\n\r\n (TPU 500+)\r\n\r\n http://www.techproductupdate.com/resources/2333/designing-a-data-protection-strategy-with-hp-lefthand-hp-storeonce-and-hp-tape\r\n\r\n- - -\r\nYou can visit the writeboard at:\r\n http://somewebsite.com\r\n'

So first I just use re.findall to locate everything between parentheses using the pattern '\((?P<list>[A-Z]*)\s(?P<segments>.+)\)'

Then I use re.findall to locate all of the URLs using the pattern 'http\S*' - this returns all the results I want with the extra 'http://somewebsite.com' at the end of the list.

Then I just zip these lists together, excluding the last element of the last list and I essentially get the results I was looking for in the first place.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.