1

How to split the string using regex

input :
result = '1,000.03AM2,97.2323,089.301,903.230.0034,928.9911,24.30AM'

Want to split this so that I can store into different strings for further use like following

o/p should be :
a = 1,000.03AM, b = 2,97.23, c = 23,089.30, d = 1,903.23, e = 0.00, f = 34,928.99, g = 11,24.30AM

I have tried like this but it's showing wrong output

import re
print(re.findall(r'[0-9.]+|[^0-9.]', result))
5
  • 1
    @shaikmoeed yes. Edited Commented Dec 6, 2019 at 8:50
  • What can be the max length of the string? Commented Dec 6, 2019 at 8:51
  • What is AM stands for? AM/PM? It looks like what you should parse it as float, but including AM/PM would make it string, unless it time. Commented Dec 6, 2019 at 8:54
  • 1
    @Abhi Your expected result is not matching with the above regex mentioned by Wiktor. Commented Dec 6, 2019 at 8:55
  • @shaikmoeed But my answer contains the solution that matches what is expected. Commented Dec 6, 2019 at 9:06

3 Answers 3

2

You may extract the strings using

re.findall(r'\d+(?:,\d+)*(?:\.\d{2})?[^,\d]*', text)

See the regex demo

Details

  • \d+ - 1+ digits
  • (?:,\d+)* - 0 or more repetitions of a comma and 1+ digits
  • (?:\.\d{2})? - an optional occurrence of a dot and 2 digits
  • [^,\d]* - any 0 or more chars other than a comma and digit.

Python demo:

import re
text = "1,000.03AM2,97.2323,089.301,903.230.0034,928.9911,24.30AM"
print( re.findall(r'\d+(?:,\d+)*(?:\.\d{2})?[^,\d]*', text) )
# => ['1,000.03AM', '2,97.23', '23,089.30', '1,903.23', '0.00', '34,928.99', '11,24.30AM']
Sign up to request clarification or add additional context in comments.

7 Comments

This gives the first element as '1,000.03AM2' where it should be '1,000.03AM' as mentioned by OP.
@shaikmoeed I have reverted to the original suggestion.
But still is not matching with expected output of OP.
After . there should only two digits(plus two alphabets if exists). But this results as 928.9911
Ok, now it does.
|
2

For your result you need following regex:

re.findall(r"[\d,]+\.\d{2}(?:AM)?", result)

This produce following:

['1,000.03AM', '2,97.23', '23,089.30', '1,903.23', '0.00', '34,928.99', '11,24.30AM']

Regex explanation:

  • [\d,] - match digits and comma
  • [\d,]+\.\d{2} - match whole float value (with two digest after dot)
  • (?:AM)? - matching optional AM in non-capturing group, in example below I use (?=AM)? to not include it into result
  • In case on the place of AM you have anything else, you may edit (?:AM) to (?:AM|Other|...)

If you need to parse it as float, I have two suggestion for you. First is removing comma:

map(lambda x: float(x.replace(",", "")), re.findall(r"[\d,]+\.\d{2}(?=AM)?", s))

Result:

[1000.03, 297.23, 23089.3, 1903.23, 0.0, 34928.99, 1124.3]

Another variant is using locale:

>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF8')
'en_US.UTF8'
>>> list(map(lambda x: locale.atof(x), re.findall(r"[\d,]+\.\d{2}(?=AM)?", s)))
[1000.03, 297.23, 23089.3, 1903.23, 0.0, 34928.99, 1124.3]

Comments

0

Provided if string length and its parameter remains same. Most efficient solution would be.

a = result[0:10]
b = result[10:17]
c = result[17:26]
d = result[26:34]
e = result[34:38]
f = result[38:47]

Hope this helps.

3 Comments

I suspect the AM optional part may be missing or present in arbitrary comma-separated fields, so this is not likely to help in the end.
If alphabetical characters aren't important then you can try this re.findall(r"[\d,]+\.\d{2}", result)
This would be perfect. re.findall(r"([\d,]+\.\d{2}[A-Z]{2}?|[\d,]+\.\d{2})", result)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.