6

I'm new to Python and still learning about regular expressions, so this question may sound trivial to some regex expert, but here you go. I suppose my question is a generalization of this question about finding a string between two strings. I wonder: what if this pattern (initial_substring + substring_to_find + end_substring) is repeated many times in a long string? For example

test='someth1 var="this" someth2 var="that" '
result= re.search('var=(.*) ', test)
print result.group(1)
>>> "this" someth2 var="that"

Instead, I'd like to get a list like ["this","that"]. How can I do it?

4
  • does it have to be regex? Commented Feb 17, 2017 at 16:17
  • 1
    That was the idea, but if there's a more sensible way to do it, please do! Commented Feb 17, 2017 at 16:21
  • @Nonancourt - there isn't, in almost any case regex will be the fastest and most 'readable' way to do it. Sure, you can do manual string search but you'd need to have a really good reason to go down that path. Commented Feb 17, 2017 at 16:23
  • @Ev.Kounis how were you thinking without re? i'm curious Commented Feb 17, 2017 at 16:24

2 Answers 2

10

Use re.findall():

result = re.findall(r'var="(.*?)"', test)
print(result)  # ['this', 'that']

If the test string contains multiple lines, use the re.DOTALL flag.

re.findall(r'var="(.*?)"', test, re.DOTALL)
Sign up to request clarification or add additional context in comments.

2 Comments

This solution does not work if the string contains \n. How would this answer be adapted to support: test = 'someth1 var="this \n then" someth2 var="that" '
@AlexFine if you need it to work over multiple lines, you need to set the re.DOTALL flag when doing your matching so that a dot matches new lines. You can pass the flag explicitly as: re.findall(r'var="(.*?)"', test, re.DOTALL), or use in-line syntax within the pattern: re.findall(r'(?s)var="(.*?)"', test).
1

The problem with your current regex is that the capture group (.*) is an extremely greedy statement. After the first instance of a var= in your string, that capture group will get everything after it.

If you instead decrease the generalization of the expression to var="(\w+)", you will not have the same issue, therefore changing that line of python to:

result = re.findall(r'var="([\w\s]+)"', test)

6 Comments

That will fail if the input string contains var="foo bar" (or any non-word character for that matter) under the assumption that he wants to extract everything between the quote marks.
@zwer yes, that may be true, but if the words within the quotes are being used as variables as per the var= prefix (an assumption that is probably not best to be made without OP specifying), the contents will never have a space
\w will capture numbers as well, and 3this is not a valid variable name either.
Thanks for the specification, @zwer. Yes, in fact, I'd be interested in the general case when it could be var="foo bar".
@Nonancourt ok, I'll make the revision now.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.