Python 2.7. Extracting data from some part of a string using a regex

Question

Let's import a regex.

import re

Assume there's a string containing some data.

data = '''Mike: Jan 25.1, Feb 24.3, Mar 29.0
   Rob: Jan 22.3, Feb 20.0, Mar 22.0
   Nick: Jan 23.4, Feb 22.0, Mar 23.4'''

For example, we want to extract floats for Rob's line only.

name = 'Rob'

I'd make it like this:

def data_extractor(name, data):
    return re.findall(r'\d+\.\d+', re.findall(r'{}.*'.format(name),data)[0])

The output is ['22.3', '20.0', '22.0'].

Is my way pythonic or it should be improved somehow? It does the job, but I'm not certain about appropriateness of such code.

Thanks for your time.

For me personally, I'd put the re.findall s on separate lines. First sets a value, second uses that value. Sure you can one line it, but for down the road reading I like it a little more explicit. Just my 2 cents — sniperd
– sniperd, Commented Jul 25, 2017 at 15:31
A possible problem is that each time data_extractor() is called it searches data from the beginning for the name. If it's an ad hoc query for a few arbitrary names, that's ok. But if you will be using all the names, this is not efficient, because it runs through the same text territory every time. — user557597
– user557597, Commented Jul 25, 2017 at 17:08
Also, pythex is a good tool for testing python regex: pythex.org — Connor
– Connor, Commented Jul 25, 2017 at 21:31

Wiktor Stribiżew · Accepted Answer · 2017-07-25 21:08:58Z

1

A non-regex way consists in splitting the lines and trimming them, and then checking which one starts with Rob and then grab all the float values:

import re
data = '''Mike: Jan 25.1, Feb 24.3, Mar 29.0
   Rob: Jan 22.3, Feb 20.0, Mar 22.0
   Nick: Jan 23.4, Feb 22.0, Mar 23.4'''
name = 'Rob'
lines = [line.strip() for line in data.split("\n")]
for l in lines:
    if l.startswith(name):
        print(re.findall(r'\d+\.\d+', l))
# => ['22.3', '20.0', '22.0']

See a Python demo

If you want to use a purely regex way, you may use a PyPi regex module with a \G based regex:

import regex
data = '''Mike: Jan 25.1, Feb 24.3, Mar 29.0
   Rob: Jan 22.3, Feb 20.0, Mar 22.0
   Nick: Jan 23.4, Feb 22.0, Mar 23.4'''
name = 'Rob'
rx = r'(?:\G(?!\A)|{}).*?(\d+\.\d+)'.format(regex.escape(name))
print(regex.findall(rx, data))

See the online Python demo

This pattern matches:

(?:\G(?!\A)|{}) - the end of the last successful match or the name contents
.*? - any 0+ chars other than line break chars, as few as possible
(\d+\.\d+) - Group 1 (just the value findall will return) matching 1+ digits, . and 1+ digits.

The regex.escape(name) will escape chars like (, ) etc. that might appear in the name.

answered Jul 25, 2017 at 21:08

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

pyramidka Over a year ago

Thank you for your answer. I have never dealt with PyPi regex module so it's something to dive into.

Collectives™ on Stack Overflow

Python 2.7. Extracting data from some part of a string using a regex

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related