0

what a regular expression to extract the text between two values?

in:

<office:annotation office:name="__Annotation__45582_97049284">
</office:annotation>
    case 1 there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__45582_97049284"/>

<office:annotation office:name="__Annotation__19324994_2345354">
</office:annotation>
    case 2there can be an arbitrary text with any symbols
<office:annotation-end office:name="__Annotation__19324994_2345354"/>

out:

list = [
'case 1 there can be an arbitrary text with any symbols',
'case 2 there can be an arbitrary text with any symbols',
]
1
  • 3
    You'll be better off using an xml parser. Commented Jul 5, 2014 at 12:49

3 Answers 3

3

It's better to use an XML parser, if you want a regex solution then try the below,

>>> str = """<office:annotation office:name="__Annotation__45582_97049284">
... </office:annotation>
...     case 1 there can be an arbitrary text with any symbols
... <office:annotation-end office:name="__Annotation__45582_97049284"/>
... 
... <office:annotation office:name="__Annotation__19324994_2345354">
... </office:annotation>
...     case 2there can be an arbitrary text with any symbols
... <office:annotation-end office:name="__Annotation__19324994_2345354"/>"""
>>> m = re.findall(r'<\/office:annotation>\s*(.*)(?=\n<office:annotation-end)', str)
>>> m
['case 1 there can be an arbitrary text with any symbols', 'case 2there can be an arbitrary text with any symbols']

OR

A better regex would be,

<\/office:annotation>([\w\W\s]*?)(?=\n?<office:annotation-end)
Sign up to request clarification or add additional context in comments.

4 Comments

strange, it seems that I need, but your example returns an empty list
does not correct for the same line without formatting (spaces, the line breaks) regex101.com/r/rK2hC3/1
Thank you very much it's work! (it is not I have set you a minus)
It may look like it works, but it doesn't. It can't. It will break at the slightest change to your input.
0

Since this is a namespaced XML document, you'll have to deal with those namespaces when selecting nodes. See this answer for details.

Here's how you would parse it using lxml and xpath expressions:

data.xml

<?xml version='1.0' encoding='UTF-8'?>
<document xmlns:office="http://www.example.org/office">

    <office:annotation office:name="__Annotation__45582_97049284">
    </office:annotation>
        case 1 there can be an arbitrary text with any symbols
    <office:annotation-end office:name="__Annotation__45582_97049284"/>

    <office:annotation office:name="__Annotation__19324994_2345354">
    </office:annotation>
        case 2there can be an arbitrary text with any symbols
    <office:annotation-end office:name="__Annotation__19324994_2345354"/>

</document>

parse.py

from lxml import etree

tree = etree.parse('data.xml')
root = tree.getroot()
nsmap = root.nsmap

annotations = root.xpath('//office:annotation', namespaces=nsmap)

comments = []
for annotation in annotations:
    comment = annotation.tail.strip()
    comments.append(comment)

print comments

Output:

['case 1 there can be an arbitrary text with any symbols',
 'case 2there can be an arbitrary text with any symbols']

Comments

0
>>> regex = re.compile(r'</.+>\s*(.+)\s*<.+>')
>>> matched = re.findall(regex, text)
>>> print(matched)
['case 1 there can be an arbitrary text with any symbols', 'case 2there can be an arbitrary text with any symbols']

Edit: There we go. Bah.. these edit points.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.