Extracting a List from Text using Regular Expression in Python

Question

I am looking to extract a list of tuples from the following string:

text='''Consumer Price Index:
        +0.2% in Sep 2020

        Unemployment Rate:
        +7.9% in Sep 2020

        Producer Price Index:
        +0.4% in Sep 2020

        Employment Cost Index:
        +0.5% in 2nd Qtr of 2020

        Productivity:
        +10.1% in 2nd Qtr of 2020

        Import Price Index:
        +0.3% in Sep 2020

        Export Price Index:
        +0.6% in Sep 2020'''

I am using 'import re' for the process.

The output should be something like: [('Consumer Price Index', '+0.2%', 'Sep 2020'), ...]

I want to use a re.findall function that produces the above output, so far I have this:

re.findall(r"(:\Z)\s+(%\Z+)(\Ain )", text)

Where I am identifying the characters prior to ':', then the characters prior to '%' and then the characters after 'in'.

I'm really just clueless on how to continue. Any help would be appreciated. Thanks!

Wiktor Stribiżew · Accepted Answer · 2020-11-01 14:09:42Z

5

You can use

re.findall(r'(\S.*):\n\s*(\+?\d[\d.]*%)\s+in\s+(.*)', text)
# => [('Consumer Price Index', '+0.2%', 'Sep 2020'), ('Unemployment Rate', '+7.9%', 'Sep 2020'), ('Producer Price Index', '+0.4%', 'Sep 2020'), ('Employment Cost Index', '+0.5%', '2nd Qtr of 2020'), ('Productivity', '+10.1%', '2nd Qtr of 2020'), ('Import Price Index', '+0.3%', 'Sep 2020'), ('Export Price Index', '+0.6%', 'Sep 2020')]

See the regex demo and the Python demo.

Details

(\S.*) - Group 1: a non-whitespace char followed with any zero or more chars other than line break chars as many as possible
: - a colon
\n - a newline
\s* - 0 or more whitespaces
(\+?\d[\d.]*%) - Group 2: optional +, a digit, zero or more digits/dots, and a %
\s+in\s+ - in enclosed with 1+ whitespaces
(.*) - Group 3: any zero or more chars other than line break chars as many as possible

answered Nov 1, 2020 at 14:09

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kaligule · Accepted Answer · 2021-01-31 19:34:18Z

Regex is not a good way to approach this. It gets hard to read and maintain very fast. It can be done much cleaner by using pythons string functions:

list_of_lines = [
    line.strip()                 # remove trailing and leading whitespace
    for line in text.split("\n") # split up the text into lines
    if line                      # filter out the empty lines
]

list_of_lines is now:

['Consumer Price Index:', '+0.2% in Sep 2020', 'Unemployment Rate:', '+7.9% in Sep 2020', 'Producer Price Index:', '+0.4% in Sep 2020', 'Employment Cost Index:', '+0.5% in 2nd Qtr of 2020', 'Productivity:', '+10.1% in 2nd Qtr of 2020', 'Import Price Index:', '+0.3% in Sep 2020', 'Export Price Index:', '+0.6% in Sep 2020']

now all we have to do is build tuples from pairs of elements of this list.

def pairwise(iterable):
    "s -> (s0, s1), (s2, s3), (s4, s5), ..."
    a = iter(iterable)
    return zip(a, a)

(from here)

Now we can get our desired output:

print(pairwise(list_of_lines))

[('Consumer Price Index:', '+0.2% in Sep 2020'), ('Unemployment Rate:', '+7.9% in Sep 2020'), ('Producer Price Index:', '+0.4% in Sep 2020'), ('Employment Cost Index:', '+0.5% in 2nd Qtr of 2020'), ('Productivity:', '+10.1% in 2nd Qtr of 2020'), ('Import Price Index:', '+0.3% in Sep 2020'), ('Export Price Index:', '+0.6% in Sep 2020')]

Collectives™ on Stack Overflow

Extracting a List from Text using Regular Expression in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related