10

How do I search an entire xml file for a specific text pattern and then replace each occurrence of that text with new text pattern in Python 3.5?

Everything else (format, attributes, comments, etc.) needs to remain as it is in the original xml file.

I am running Python 3.5.1 on Windows (win32).

Specifically, I would like to replace each occurrence of "FEATURE NAME" with "THIS WORKED" and replace each occurrence of "FEATURE NUMBER" with "12345".

I have been trying to learn Python and xml.etree.ElementTree but cannot figure this out. I already looked at "Search and replace a line in a .xml file in Python", "Search and replace a line in a file in Python", and "How to search and replace text in a file using Python?" and other existing Q/A's on this site but cannot figure this out - I'm not an experienced programmer, so please let me know if more input is needed . Your help is greatly appreciated!!!

Here is a copy of what the xml code looks like when I open it in Notepad (except I added spaces to indent each line and hit return for some lines when I pasted it into this question):

<description-topic>
    <access-info>
        <index-term-set>
            <index-term>
                <primary>FID FEATURE NUMBER</primary>
            </index-term>
            <index-term>
                <primary>FEATURE NAME</primary>
            </index-term>
            <index-term>
                <primary>Common features</primary>
                <secondary>FID FEATURE NUMBER</secondary>
            </index-term>
        </index-term-set>
    </access-info>
    <title>FEATURE NUMBER - FEATURE NAME</title>
    <block>
        <label>Platform</label>
        <comment>REVIEWERS: I guessed at the FEATURE NAME</comment>
        <para>
            This feature applies to the following platforms: FEATURE NAME<!--Check the values--></para>
    </block>
    <block branch="no">
        <label>Feature Benefits</label>
        <para>
            <comment>REVIEWERS: What do we put here? See template (link given in review email) for more information.</comment>
        </para>
    </block>
    <block branch="no">
        <label>Dependencies</label>
        <para/>
        <subblock>
            <label>Features</label>
            <comment>What FEATURE NAME do we put here?</comment>
        </subblock>
        <subblock>
            <label>Hardware</label>
            <comment>What FEATURE NAME do we put here?</comment>
            <para>This feature applies to the following: FEATURE NUMBER and text.</para><?Pub Caret -1?>
        </subblock>
        <subblock>
            <label>Dependencies outside the eNodeB</label>
            <comment>What FEATURE NAME do we put here?</comment>
        </subblock>
    </block>
    <block branch="no">
        <label>Impacts</label>
        <comment>REVIEWERS: What FEATURE NUMBER do we put here?</comment>
        <para>
            <comment/>
        </para>
    </block>
</description-topic>

Here is the latest code I am trying to get to work:

from xml.etree import ElementTree as et
tree = et.parse('Atemplate2.xml')
tree.find('description-topic/access-info/index-term-set/index-term/primary/').text = '12345'
tree.write('Atemplate2.xml')

I get the following error: Traceback (most recent call last): File "ajktest18.py", line 15, in tree.find('description-topic/access-info/index-term-set/index-term/primary/').text = '12345'

AttributeError: 'NoneType' object has no attribute 'text'

I would prefer to be able to search and modify any occurrences in the entire file, but I can't figure out how to get to even one specific occurrence of the text I am searching for.

Here is the code I tried to use to find the path:

import xml.etree.ElementTree as ET
tree = ET.parse('Atemplate.xml')
root = tree.getroot()

print(root.tag, root.attrib, root.text)

for child in root:
    print(child.tag, child.attrib, child.text)
for label in root.iter('label'):
    print(label.tag, label.attrib, label.text)
for title in root.iter('title'):
    print(title.attrib)

I also tried the following code:

with open('Atemplate2.xml') as f:
    tree = ET.parse(f)
    root = tree.getroot()

for elem in root.getiterator():
    try:
        elem.text = elem.text.replace('FEATURE NAME', 'THIS WORKED')
        elem.text = elem.text.replace('FEATURE NUMBER', '12345')
    except AttributeError:
        pass

tree.write('output.xml')

but that gives the following error:

File "<pyshell#40>", line 2, in <module>
    tree = ET.parse(f)
File "C:\MyPath\Python35-32\lib\xml\etree\ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
File "C:\ MyPath \Python35-32\lib\xml\etree\ElementTree.py", line 594, in parse
    self._root = parser._parse_whole(source)
File "C:\ MyPath \Python35-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1119: character maps to

# #

FINAL UPDATE - Here is the code that worked for me in the end (thank u, Jarad!):

import lxml.etree as ET
#using lxml instead of xml preserved the comments

#adding the encoding when the file is opened and written is needed to avoid a charmap error
with open('filename.xml', encoding="utf8") as f:
  tree = ET.parse(f)
  root = tree.getroot()


  for elem in root.getiterator():
    try:
      elem.text = elem.text.replace('FEATURE NAME', 'THIS WORKED')
      elem.text = elem.text.replace('FEATURE NUMBER', '123456')
    except AttributeError:
      pass

#tree.write('output.xml', encoding="utf8")
# Adding the xml_declaration and method helped keep the header info at the top of the file.
tree.write('output.xml', xml_declaration=True, method='xml', encoding="utf8")
3
  • I am trying to learn how to do this and automate something. That is why I am here. I have Python training books and created programs and modified them. I can't figure out how to do this part. My point with my comment is that I might be missing something very easy and obvious...or just let me know if there is additional input that is needed to help answer my question. Thank you for anyone who can help. Commented Jun 16, 2016 at 20:41
  • Fair enough, then you should post the code you have written, and check how to ask. Commented Jun 16, 2016 at 21:23
  • @spectras - I posted more information including code and deleted the generic abc text reference but kept the more specific text. Thanks Commented Jun 17, 2016 at 21:14

3 Answers 3

12

Caveats:

  • I have never worked with the xml.etree.ElementTree library
  • I have never worked with it because I never find myself manipulating XML
  • I don't know if this is the "best" way compared to someone that knows the library in and out
  • Commentors seem set on judging you instead of helping you out

This is a modification from this excellent answer. The thing is, you need to read the XML file in and parse it.

import xml.etree.ElementTree as ET

with open('xmlfile.xml', encoding='latin-1') as f:
  tree = ET.parse(f)
  root = tree.getroot()

  for elem in root.getiterator():
    try:
      elem.text = elem.text.replace('FEATURE NAME', 'THIS WORKED')
      elem.text = elem.text.replace('FEATURE NUMBER', '123456')
    except AttributeError:
      pass

tree.write('output.xml', encoding='latin-1')

Note that you can change the encoding parameter to something else such as: utf-8, cp1252, ISO-8859-1, etc. Really depends on your system and file.

Sign up to request clarification or add additional context in comments.

8 Comments

Thank you. The "excellent answer" post you provided looks just like what I am trying to do.
I tried the code you provided and get the following error: Traceback (most recent call last): File "<pyshell#40>", line 2, in <module> tree = ET.parse(f) File "C:\MyPath\Python35-32\lib\xml\etree\ElementTree.py", line 1182, in parse tree.parse(source, parser) File "C:\ MyPath \Python35-32\lib\xml\etree\ElementTree.py", line 594, in parse self._root = parser._parse_whole(source)
It's hard to say without seeing your code. Can you append your post with an edit showing your new approach?
I edited my post and included my code for the example you provided. Thanks for your help.
It seems to me like it might be an encoding issue. Change with open('xmlfile.xml') as f: to have an encoding parameter like this with open('xmlfile.xml', encoding='utf-8') as f:. If utf-8 doesn't work, try any of these: cp1252, latin-1. Also, when you write in tree.write('output.xml'), be sure to add the same encoding parameter so the output file preserves the encoding. I updated my answer.
|
0

Only this solution worked for me.

import lxml.etree as ET
tree = ET.parse("input.xml")
root = tree.getroot()
tree = root.getroottree()
for elem in root.getiterator():
    try:
        elem.text = elem.text.replace('abc', 'xyz')
    except Exception:
        pass
tree.write('output.xml', xml_declaration=True, method='xml', encoding="utf-16") 

Comments

0

Just a note: Python deprecated the .getiterator() a few versions ago so you'll likely see an AttributeError if you use it now.

.iter() is the correct method to call with the xml code referenced in above snippets.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.