Extract continuous numeric characters from a string in Python

Question

I am interested in extracting a number that appears after a set of characters ('AA='). However, the issue is: (i) I am not aware how long the number is, (ii) I don't know what appears right after the number (could be a blank space or ANY character except 0-9, consider that I do not know what these characters could be but they are definitely not 0-9), (iii) number can be present in exponential form (line 4/5 below)

Given below are few of many inputs that I can have.

Line 1: 123 NUBA AA=1.2345 $BB=1234.55
Line 2: 123 NUBA MM AA=1.2345678&BB=1234.55
Line 3: 123 NUBA RRNJH AA=1.2#ALPHA
Line 4: 123 NUBA ABCD AA=1.2E-5 GBRO
Line 5: 123 NUBA ABCD AA=1.245E-7$ MN
...

The result should be: 1.2345 1.2345678 1.2 1.2e-5 1.245e-7 for each respective line above.

PS: I know how to use .find and get the starting location of AA= but that is not very helpful for the above conditions. Also, I understand one way could be to loop through each character after after AA= and break if a blank space or anything except [0-9,., E, -] is seen, but that is clumsy and takes unnecessary space in my code. I am looking for a more neat way of doing this.

The neat way is to use a regular expression, that's what they were invented for. Start with the re module. — Mark Ransom
– Mark Ransom, Commented Jan 15, 2021 at 22:55
@MarkRansom: Thanks, can you please share a simple relevant example? — nuki
– nuki, Commented Jan 15, 2021 at 23:01

The fourth bird · Accepted Answer · 2021-01-16 09:13:51Z

2

You could use a single pattern with a capture group. Use re.findall for example to get the value of the capture group only.

\bAA=(\d+(?:\.\d+)?(?:[eE][-+]?[0-9]+)?)

Explanation

\bAA= A word boundary, then match AA=
( Capture group 1
- \d+ Match 1+ digits
- (?:\.\d+)? Match an optional decimal part
- (?:[eE][-+]?[0-9]+)? Match an optional exponential part
) Close group 1

Regex demo

import re
 
regex = r"\bAA=(\d+(?:\.\d+)?(?:[eE][-+]?[0-9]+)?)"
 
s = ("Line 1: 123 NUBA AA=1.2345 $BB=1234.55\n"
    "Line 2: 123 NUBA MM AA=1.2345678&BB=1234.55\n"
    "Line 3: 123 NUBA RRNJH AA=1.2#ALPHA\n"
    "Line 4: 123 NUBA ABCD AA=1.2E-5 GBRO\n"
    "Line 5: 123 NUBA ABCD AA=1.245E-7$ MN")
 
print(re.findall(regex, s))

Output

['1.2345', '1.2345678', '1.2', '1.2E-5', '1.245E-7']

Python demo

edited Jan 16, 2021 at 9:13

answered Jan 15, 2021 at 23:30

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

nuki Over a year ago

Interesting! Thanks! I am stuck at some other places which I forgot to include in my question before. Please see my updated question. How Line 4 and 5 can be handled as well please?

nuki Over a year ago

Thanks for updating your answer. That explanation is very exhaustive and helpful for someone who has never used regex before. Thanks!!

Mitchell Olislagers · Accepted Answer · 2021-01-15 23:03:55Z

1

This will give you the output you want

import re

string1 = '123 NUBA AA=1.2345 $BB=1234.55'
string2 = '123 NUBA MM AA=1.2345678&BB=1234.55'
string3 = '123 NUBA RRNJH AA=1.2#ALPHA'

re.findall(r'\d+\.*\d*', string1[string1.find("AA="):])[0]
re.findall(r'\d+\.*\d*', string2[string2.find("AA="):])[0]
re.findall(r'\d+\.*\d*', string3[string3.find("AA="):])[0]

Output

1.2345
1.2345678
1.2

answered Jan 15, 2021 at 23:03

Mitchell Olislagers

1,8271 gold badge6 silver badges11 bronze badges

1 Comment

nuki Over a year ago

This works! But I am stuck at some other places which I forgot to include in my question before. Please see my updated question. How Line4 and 5 can be handled?

Collectives™ on Stack Overflow

Extract continuous numeric characters from a string in Python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related