1

I'm working with a multi-line string, trying to capture valid comma separated numbers in the string.

For example:

my_string = """42     <---capture 42 in this line
1,234    <---capture 1,234 in this line
3,456,780    <---capture 3,456,780 in this line
34,56,780    <---don't capture anything in this line but 34 and 56,780 captured
1234    <---don't capture anything in this line but 123 and 4 captured
"""

Ideally, I want re.findall to return:

['42', '1,234', '3,456,780']

Here are my code:

a = """
42
1,234
3,456,780
34,56,780
1234
"""
regex = re.compile(r'\d{1,3}(?:,\d{3})*')
print(regex.findall(a))

The result with my code above is:

['42', '1,234', '3,456,780', '34', '56,780', '123', '4']

But my desired output should be:

['42', '1,234', '3,456,780']
6
  • Unrelated to the problem: you don't need the capturing group around the whole regexp. Commented Mar 2, 2020 at 5:16
  • Is the result with your code correct? If so, what is your question? Commented Mar 2, 2020 at 5:24
  • @CarySwoveland, good question. I just fixed my question. Commented Mar 2, 2020 at 5:26
  • Given your desired result (['42', '1,234', '3,456,780']), what do you mean by, "...but 34 and 56,780 captured" and "...but 123 and 4 captured"? Commented Mar 2, 2020 at 6:09
  • @CarySwoveland, 34,56,780(has only two digit(56) between commas) and 1234(lacks comma) is not a valid comma separated format. So I want invalid comma separated number not to be captured. Commented Mar 2, 2020 at 7:59

2 Answers 2

3

If you only want to capture whole lines that match the pattern, you need to anchor the regexp with ^ and $, and use the re.MULTILINE flag so that they match line beginnings/endings rather than only string beginning/ending.

regex = re.compile(r'^\d{1,3}(?:,\d{3})*$', re.MULTILINE)
Sign up to request clarification or add additional context in comments.

4 Comments

Either I'm missing something obvious or you've forgotten to put the ^ and $ at the ends of the regex. Without them, the regex has the same result as OP's question.
@Barmar, Given a multi-line string which only has digits (no white spaces, alphas, etc in each line), your suggestion works. What if each line has alphas and special characters in it as follows? a = """ 42 asdfad <-- 42 should be captured in this line 1,234 as d <-- 1,234 should be captured in this line 3,456,780 <-- 3,456,780 should be captured in this line 34,56,780 <-- nothing should be captured in this line 1234 <--nothing should be captured in this line """
You could put \D* after ^ and before $.
@aneroid Oops, I had made the change while testing at regex101 but forgot to copy it to the answer.
1

Use lookarounds to make sure we haven't digit or comma before and after the numbers:

import re

a = """
42
1,234
3,456,780
34,56,780
1234
"""
regex = re.compile(r'(?<![\d,])\d{1,3}(?:,\d{3})*(?![\d,])')
print(regex.findall(a))    

Output:

['42', '1,234', '3,456,780']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.