3

I have an XML file, e.g.:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    First line. <br/> Second line.
</root>

As an output I want to get: '\nFirst line. <br/> Second line.\n' I just want to notice, if the root element contains other nested elements, they should be returned as is.

3
  • So you just want to strip off the start and end tags of the root element? Commented Jul 12, 2011 at 20:12
  • Basically, yes. But I need general-purpose approach. I mean that XML could be not exactly the same, e.g. it can contain <!DOCTYPE> declaration, etc. Commented Jul 13, 2011 at 14:13
  • Do you want the parsed content of the root element (which might include expanded entities for example), or do you simply want the verbatim string between the start and end tags? Commented Jul 13, 2011 at 15:45

2 Answers 2

3

The first that I came up with:

from xml.etree.ElementTree import fromstring, tostring

source = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
    First line.<br/>Second line.
</root>
'''

xml = fromstring(source)
result = tostring(xml).lstrip('<%s>' % xml.tag).rstrip('</%s>' % xml.tag)

print result

# output:
#
#   First line.<br/>Second line. 
#

But it's not truly general-purpose approach since it fails if opening root element (<root>) contains any attribute.

UPDATE: This approach has another issue. Since lstrip and rstrip match any combination of given chars, you can face such problem:

# input:
<?xml version="1.0" encoding="UTF-8"?><root><p>First line</p></root>

# result:
p>First line</p

If your really need only literal string between the opening and closing tags (as you mentioned in the comment), you can use this:

from string import index, rindex
from xml.etree.ElementTree import fromstring, tostring

source = '''<?xml version="1.0" encoding="UTF-8"?>
<root attr1="val1">
    First line.<br/>Second line.
</root>
'''

# following two lines are needed just to cut
# declaration, doctypes, etc.
xml = fromstring(source)
xml_str = tostring(xml)

start = index(xml_str, '>')
end = rindex(xml_str, '<')

result = xml_str[start + 1 : -(len(xml_str) - end)]

Not the most elegant approach, but unlike the previous one it works correctly with attributes within opening tag as well as with any valid xml document.

Sign up to request clarification or add additional context in comments.

2 Comments

xml_str = tostring() should be xml_str = tostring(xml).
Namespaces can mess things up. For example, if the root element has a xmlns="http://foo.com" declaration, your solution does not quite work.
0

Parse from file:

from xml.etree.ElementTree import parse
tree = parse('yourxmlfile.xml')
print tree.getroot().text

Parse from string:

from xml.etree.ElementTree import fromstring
print fromstring(yourxmlstr).text

5 Comments

Thanks. But how can I parse XML from string, not from file?
That would return '\n First line. ', not what the OP wanted.
@asm from string: use xml.etree.ElementTree.fromstring.
@Santa Unfortunately xml.etree.ElementTree.fromstring returns Element, not ElementTree. So it doesn't contain getroot() method.
@Santa You are right. But unfortunately you are also right with your first comment :). text attribute returns only \n First line.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.