Python regular expression extract the text between two values

Question

what a regular expression to extract the text between two values?

in:

<office:annotation office:name="__Annotation__45582_97049284">
</office:annotation>
    case 1 there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__45582_97049284"/>

<office:annotation office:name="__Annotation__19324994_2345354">
</office:annotation>
    case 2there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__19324994_2345354"/>

out:

list = [
'case 1 there can be an arbitrary text with any symbols',
'case 2 there can be an arbitrary text with any symbols',
]

You'll be better off using an xml parser.

Paulo Bu
– Paulo Bu

2014-07-05 12:49:18 +00:00
Commented Jul 5, 2014 at 12:49 — Paulo Bu
– Paulo Bu, Commented Jul 5, 2014 at 12:49

Avinash Raj · Accepted Answer · 2014-07-05 13:27:03Z

3

It's better to use an XML parser, if you want a regex solution then try the below,

>>> str = """<office:annotation office:name="__Annotation__45582_97049284">
... </office:annotation>
...     case 1 there can be an arbitrary text with any symbols
... <office:annotation-end office:name="__Annotation__45582_97049284"/>
... 
... <office:annotation office:name="__Annotation__19324994_2345354">
... </office:annotation>
...     case 2there can be an arbitrary text with any symbols
... <office:annotation-end office:name="__Annotation__19324994_2345354"/>"""
>>> m = re.findall(r'<\/office:annotation>\s*(.*)(?=\n<office:annotation-end)', str)
>>> m
['case 1 there can be an arbitrary text with any symbols', 'case 2there can be an arbitrary text with any symbols']

OR

A better regex would be,

<\/office:annotation>([\w\W\s]*?)(?=\n?<office:annotation-end)

edited Jul 5, 2014 at 13:27

answered Jul 5, 2014 at 12:58

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Andrew Nodermann Over a year ago

strange, it seems that I need, but your example returns an empty list

Andrew Nodermann Over a year ago

does not correct for the same line without formatting (spaces, the line breaks) regex101.com/r/rK2hC3/1

Andrew Nodermann Over a year ago

Thank you very much it's work! (it is not I have set you a minus)

Lukas Graf Over a year ago

It may look like it works, but it doesn't. It can't. It will break at the slightest change to your input.

Community · Accepted Answer · 2017-05-23 12:11:01Z

Since this is a namespaced XML document, you'll have to deal with those namespaces when selecting nodes. See this answer for details.

Here's how you would parse it using lxml and xpath expressions:

data.xml

<?xml version='1.0' encoding='UTF-8'?>
<document xmlns:office="http://www.example.org/office">

    <office:annotation office:name="__Annotation__45582_97049284">
    </office:annotation>
        case 1 there can be an arbitrary text with any symbols
    <office:annotation-end office:name="__Annotation__45582_97049284"/>

    <office:annotation office:name="__Annotation__19324994_2345354">
    </office:annotation>
        case 2there can be an arbitrary text with any symbols
    <office:annotation-end office:name="__Annotation__19324994_2345354"/>

</document>

parse.py

from lxml import etree

tree = etree.parse('data.xml')
root = tree.getroot()
nsmap = root.nsmap

annotations = root.xpath('//office:annotation', namespaces=nsmap)

comments = []
for annotation in annotations:
    comment = annotation.tail.strip()
    comments.append(comment)

print comments

Output:

['case 1 there can be an arbitrary text with any symbols',
 'case 2there can be an arbitrary text with any symbols']

kevr · Accepted Answer · 2014-07-05 13:29:27Z

0

>>> regex = re.compile(r'</.+>\s*(.+)\s*<.+>')
>>> matched = re.findall(regex, text)
>>> print(matched)
['case 1 there can be an arbitrary text with any symbols', 'case 2there can be an arbitrary text with any symbols']

Edit: There we go. Bah.. these edit points.

edited Jul 5, 2014 at 13:29

answered Jul 5, 2014 at 13:22

kevr

4554 silver badges9 bronze badges

Collectives™ on Stack Overflow

Python regular expression extract the text between two values

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related