Python: Find a string between two strings, repeatedly

Question

I'm new to Python and still learning about regular expressions, so this question may sound trivial to some regex expert, but here you go. I suppose my question is a generalization of this question about finding a string between two strings. I wonder: what if this pattern (initial_substring + substring_to_find + end_substring) is repeated many times in a long string? For example

test='someth1 var="this" someth2 var="that" '
result= re.search('var=(.*) ', test)
print result.group(1)
>>> "this" someth2 var="that"

Instead, I'd like to get a list like ["this","that"]. How can I do it?

That was the idea, but if there's a more sensible way to do it, please do! — Nonancourt
– Nonancourt, Commented Feb 17, 2017 at 16:21
@Nonancourt - there isn't, in almost any case regex will be the fastest and most 'readable' way to do it. Sure, you can do manual string search but you'd need to have a really good reason to go down that path. — zwer
– zwer, Commented Feb 17, 2017 at 16:23

Alex Fine · Accepted Answer · 2021-01-10 14:28:22Z

10

Use re.findall():

result = re.findall(r'var="(.*?)"', test)
print(result)  # ['this', 'that']

If the test string contains multiple lines, use the re.DOTALL flag.

re.findall(r'var="(.*?)"', test, re.DOTALL)

edited Jan 10, 2021 at 14:28

Alex Fine

1491 silver badge11 bronze badges

answered Feb 17, 2017 at 16:15

zwer

25.9k3 gold badges53 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alex Fine Over a year ago

This solution does not work if the string contains \n. How would this answer be adapted to support: test = 'someth1 var="this \n then" someth2 var="that" '

zwer Over a year ago

@AlexFine if you need it to work over multiple lines, you need to set the re.DOTALL flag when doing your matching so that a dot matches new lines. You can pass the flag explicitly as: re.findall(r'var="(.*?)"', test, re.DOTALL), or use in-line syntax within the pattern: re.findall(r'(?s)var="(.*?)"', test).

m_callens · Accepted Answer · 2017-02-17 16:29:22Z

1

The problem with your current regex is that the capture group (.*) is an extremely greedy statement. After the first instance of a var= in your string, that capture group will get everything after it.

If you instead decrease the generalization of the expression to var="(\w+)", you will not have the same issue, therefore changing that line of python to:

result = re.findall(r'var="([\w\s]+)"', test)

edited Feb 17, 2017 at 16:29

answered Feb 17, 2017 at 16:16

m_callens

6,4029 gold badges35 silver badges59 bronze badges

6 Comments

zwer Over a year ago

That will fail if the input string contains var="foo bar" (or any non-word character for that matter) under the assumption that he wants to extract everything between the quote marks.

m_callens Over a year ago

@zwer yes, that may be true, but if the words within the quotes are being used as variables as per the var= prefix (an assumption that is probably not best to be made without OP specifying), the contents will never have a space

zwer Over a year ago

\w will capture numbers as well, and 3this is not a valid variable name either.

Nonancourt Over a year ago

Thanks for the specification, @zwer. Yes, in fact, I'd be interested in the general case when it could be var="foo bar".

m_callens Over a year ago

@Nonancourt ok, I'll make the revision now.

|

Collectives™ on Stack Overflow

Python: Find a string between two strings, repeatedly

2 Answers 2

2 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related