0

I have multiple strings like :

a = 'avg yearly income 25,07,708.33 '
b = 'current balance 1,25,000.00 in cash\n'
c = 'target savings 50,00,000.00 within next five years 1,000,000.00 '

I'm trying to split them into chunks of strings of texts and strings of numbers with sample output like :

aa = [('avg yearly income', '25,07,708.33')]
bb = [('current balance', '1,25,000.00', 'in cash')]
cc = [('target savings', '50,00,000.00', 'within next five years', '1,000,000.00')]

I'm using the following code :

import re
b = b.replace("\n","")
aa = re.findall(r'(.*)\s+(\d+(?:,\d+)*(?:\.\d){1,2})', a)
bb = re.findall(r'(.*)\s+(\d+(?:,\d+)*(?:\.\d){1,2})(.*)\s+', b)
cc = re.findall(r'(.*)\s+(\d+(?:,\d+)*(?:\.\d){1,2})(.*)\s+(\d+(?:,\d+)*(?:\.\d{1,2})?)', c)

I'm getting following output :

aa = [('avg yearly income', '25,07,708.3')]
bb = [('current balance', '1,25,000.0', '0 in')]
cc = [('target savings', '50,00,000.0', '0 within next five years', '1,000,000.00')]

What's wrong with the pattern of regular expressions?

4 Answers 4

2

Instead of re.findall, you can use re.split to split the strings on a space bounded by a letter and a digit:

import re
d = ['avg yearly income 25,07,708.33 ', 'current balance 1,25,000.00 in cash\n', 'target savings 50,00,000.00 within next five years 1,000,000.00 ']
final_results = [re.split('(?<=[a-zA-Z])\s(?=\d)|(?<=\d)\s(?=[a-zA-Z])', i) for i in d]
new_results = [[i.rstrip() for i in b] for b in final_results]

Output:

[['avg yearly income', '25,07,708.33'], ['current balance', '1,25,000.00', 'in cash'], ['target savings', '50,00,000.00', 'within next five years', '1,000,000.00']]
Sign up to request clarification or add additional context in comments.

Comments

1

You can use re.split with the ptrn r'(?<=\d)\s+(?=\w)|(?<=\w)\s+(?=\d)'

>>> ptrn = r'(?<=\d)\s+(?=\w)|(?<=\w)\s+(?=\d)'
>>> re.split(ptrn, a)
['avg yearly income', '25,07,708.33 ']
>>> re.split(ptrn, b)
['current balance', '1,25,000.00', 'in cash\n']
>>> re.split(ptrn, c)
['target savings', '50,00,000.00', 'within next five years', '1,000,000.00 ']

Comments

0

Use re.split(); This example uses your original regexp and it works fine:

>>> r = re.compile(r'(\d+(?:,\d+)*(?:\.\d{1,2}))')
>>> r.split('avg yearly income 25,07,708.33 ')
['avg yearly income ', '25,07,708.33', ' ']
>>> r.split('current balance 1,25,000.00 in cash\n')
['current balance ', '1,25,000.00', ' in cash\n']
>>> r.split('target savings 50,00,000.00 within next five years 1,000,000.00 ')
['target savings ', '50,00,000.00', ' within next five years ', '1,000,000.00', ' ']

Comments

0

You can use split as said in above answers.

import re
a = 'avg yearly income 25,07,708.33 '
b = 'current balance 1,25,000.00 in cash\n'
c = 'target savings 50,00,000.00 within next five years 1,000,000.00 '

aa = re.split(r'(\d+(?:,\d+)*(?:\.\d{1,2}))', a)
bb = re.split(r'(\d+(?:,\d+)*(?:\.\d{1,2}))', b)
cc = re.split(r'(\d+(?:,\d+)*(?:\.\d{1,2}))', c)

print(aa)
print(bb)
print(cc)

You can get output like

['avg yearly income ', '25,07,708.33', ' ']
['current balance ', '1,25,000.00', ' in cash\n']
['target savings ', '50,00,000.00', ' within next five years ', '1,000,000.00', ' ']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.