4

I am looking to extract a list of tuples from the following string:

text='''Consumer Price Index:
        +0.2% in Sep 2020

        Unemployment Rate:
        +7.9% in Sep 2020

        Producer Price Index:
        +0.4% in Sep 2020

        Employment Cost Index:
        +0.5% in 2nd Qtr of 2020

        Productivity:
        +10.1% in 2nd Qtr of 2020

        Import Price Index:
        +0.3% in Sep 2020

        Export Price Index:
        +0.6% in Sep 2020'''

I am using 'import re' for the process.

The output should be something like: [('Consumer Price Index', '+0.2%', 'Sep 2020'), ...]

I want to use a re.findall function that produces the above output, so far I have this:

re.findall(r"(:\Z)\s+(%\Z+)(\Ain )", text)

Where I am identifying the characters prior to ':', then the characters prior to '%' and then the characters after 'in'.

I'm really just clueless on how to continue. Any help would be appreciated. Thanks!

2 Answers 2

5

You can use

re.findall(r'(\S.*):\n\s*(\+?\d[\d.]*%)\s+in\s+(.*)', text)
# => [('Consumer Price Index', '+0.2%', 'Sep 2020'), ('Unemployment Rate', '+7.9%', 'Sep 2020'), ('Producer Price Index', '+0.4%', 'Sep 2020'), ('Employment Cost Index', '+0.5%', '2nd Qtr of 2020'), ('Productivity', '+10.1%', '2nd Qtr of 2020'), ('Import Price Index', '+0.3%', 'Sep 2020'), ('Export Price Index', '+0.6%', 'Sep 2020')]

See the regex demo and the Python demo.

Details

  • (\S.*) - Group 1: a non-whitespace char followed with any zero or more chars other than line break chars as many as possible
  • : - a colon
  • \n - a newline
  • \s* - 0 or more whitespaces
  • (\+?\d[\d.]*%) - Group 2: optional +, a digit, zero or more digits/dots, and a %
  • \s+in\s+ - in enclosed with 1+ whitespaces
  • (.*) - Group 3: any zero or more chars other than line break chars as many as possible
Sign up to request clarification or add additional context in comments.

Comments

1

Regex is not a good way to approach this. It gets hard to read and maintain very fast. It can be done much cleaner by using pythons string functions:

list_of_lines = [
    line.strip()                 # remove trailing and leading whitespace
    for line in text.split("\n") # split up the text into lines
    if line                      # filter out the empty lines
]

list_of_lines is now:

['Consumer Price Index:', '+0.2% in Sep 2020', 'Unemployment Rate:', '+7.9% in Sep 2020', 'Producer Price Index:', '+0.4% in Sep 2020', 'Employment Cost Index:', '+0.5% in 2nd Qtr of 2020', 'Productivity:', '+10.1% in 2nd Qtr of 2020', 'Import Price Index:', '+0.3% in Sep 2020', 'Export Price Index:', '+0.6% in Sep 2020']

now all we have to do is build tuples from pairs of elements of this list.

def pairwise(iterable):
    "s -> (s0, s1), (s2, s3), (s4, s5), ..."
    a = iter(iterable)
    return zip(a, a)

(from here)

Now we can get our desired output:

print(pairwise(list_of_lines))
[('Consumer Price Index:', '+0.2% in Sep 2020'), ('Unemployment Rate:', '+7.9% in Sep 2020'), ('Producer Price Index:', '+0.4% in Sep 2020'), ('Employment Cost Index:', '+0.5% in 2nd Qtr of 2020'), ('Productivity:', '+10.1% in 2nd Qtr of 2020'), ('Import Price Index:', '+0.3% in Sep 2020'), ('Export Price Index:', '+0.6% in Sep 2020')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.