0

I'm trying to create regular expression that filters from the following partial text:

amd64 build of software 1:0.98.10-0.2svn20090909 in archive

what I want to extract is:

software 1:0.98.10-0.2svn20090909

How can I do this?? I've been trying and this is what I have so far:

p = re.compile('([a-zA-Z0-9\-\+\.]+)\ ([0-9\:\.\-]+)')
iterator = p.finditer("amd64 build of software 1:0.98.10-0.2svn20090909 in archive")
for match in iterator:
    print match.group()

with result:

software 1:0.98.10-0.2

(svn20090909 is missing)

Thanks a lot.

2
  • 2
    Can you elaborate on the precise thing you want to capture? What should be captured and what not, what is the thing that changes precisely? Commented Dec 13, 2009 at 18:29
  • SHould you not be using raw strings or doubling up on back slashes? Commented Dec 13, 2009 at 18:49

3 Answers 3

3

This will work:

p = re.compile(r'([a-zA-Z0-9\-\+\.]+)\ ([0-9][0-9a-zA-Z\:\.\-]+)')
iterator = p.finditer("amd64 build of dvdrip software 1:0.98.10-0.2svn20090909 in archive")
for match in iterator:
    print match.group()
# Prints: software 1:0.98.10-0.2svn20090909

That works by allowing the captured section to contain letters while still insisting that it starts with a number.

Without seeing all the other strings it needs to match, I can't be sure whether that's good enough.

Sign up to request clarification or add additional context in comments.

Comments

3

If you have consistent lines, this is, if each entry is on one line and the first word you want is always before the numbers part (the 1:0.98 ... part) you don't need a regexp. Try this:

>>> s = 'amd64 build of software 1:0.98.10-0.2svn20090909 in archive'
>>> match = [s.split()[3], s.split()[4]]
>>> print match
['software', '1:0.98.10-0.2svn20090909']
>>> # alternatively
>>> match = s.split()[3:5] # for same result

what this is doing is the following: it first splits the line s at the spaces (using the string method split()) and selects the fourth and fifth elements of the resulting list; both are stored in the variable match.

Again , this only works if you have one entry per line and if the 'software' part always comes before the 1:0.98.10-0.2svn20090909 part.

I often avoid regexps when I can do with split lists. If the parsing becomes a nightmare, I use pyparsing.

Comments

0

Don't use a capturing group if you want everything in one piece.

1 Comment

I want the capturing group :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.