0

i am trying to update a number of XML files in a specific directory.

I want to update the data between the tags for <author> Smith, Paul </author> and <price> 79.99 </price>. this needs to be applied for all the files within the same directory.

My code in python does make the update but inserts a new data of author and price, it should over write this data.

Any help will be much appreciated, see below xml and python coding.

See below is one of many XML files i am trying to update update:

<?xml version="1.0"?>
<catalog>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
</catalog>

Current Python code, trying to loop through all files in a specific directory:

import os
import re
my_dir = r'C:\Users\MyDocuments\Desktop\XML Test data1'

replace_what_author = '(?<=<author>)((((.)*,)*)(.)*)(?=</)'
replace_with_author = 'Bloggs, Joe'

replace_what_price = '(?<=<price>)(\d+\.\d\d(?!\d))(?=</)'
replace_with_price = '10.99' 

for fn in os.listdir(my_dir):
    #print(fn)
    pathfn = os.path.join(my_dir,fn)

    if os.path.isfile(pathfn):
        file = open(pathfn, 'r+')

        new_file_content=''


        for line in file:
            auth = re.compile(replace_what_author)
            pri = re.compile(replace_what_price) 

            new_file_content += auth.sub(replace_with_author,line)
            new_file_content += pri.sub(replace_with_price,line)       

        file.seek(0)
        file.truncate()
        file.write(new_file_content)
        file.close()
3
  • you should use the xml module and not regex for a better solution Commented Jan 4, 2020 at 19:49
  • Can you show the output xml after executing your code and what is the expected output? Commented Jan 4, 2020 at 20:04
  • Do not use regex on XML: You can't parse X|HTML with regex. Commented Jan 4, 2020 at 20:28

2 Answers 2

1

You should use the built-in xml module to modify your XML files instead of regex. First on top declare this constant and import the module.

import xml.etree.ElementTree as ET

map_ = {'author': 'Bloggs, Joe', 'price': '10.99'} 

Then after the following line

if os.path.isfile(pathfn):

I would change the logic to the following, so it would look like this

if os.path.isfile(pathfn):
    tree = ET.parse(pathfn)
    root = tree.getroot()

    for key in map_:
        for child in root.iter(key):
            child.text = map_[key]
    tree.write(pathfn, xml_declaration=True)

Then the file ends up looking like this

<catalog>
   <book id="bk103">
      <author>Bloggs, Joe</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>10.99</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
</catalog>
Sign up to request clarification or add additional context in comments.

4 Comments

Hi @aws_apprentice, thank you that's really helpful and is alot more simpler. I have noticed one change to the XML file(s). Each XML file loses the following from the very top of the file <?xml version="1.0"?>. The file loses the XML tag is there another step i need to add? apologies if i am not making an sense and happy to clarify.
see below what i have tried taking on your suggestion, see revised comment
import xml.etree.ElementTree as ET import os my_dir = r'C:\Users\riati\Desktop\01. Data Training\04. Python Spyder testing\XML Test data1' map_ = {'author': 'islam, Riat', 'price': '14.99'} for fn in os.listdir(my_dir): #print(fn) pathfn = os.path.join(my_dir,fn) if os.path.isfile(pathfn): tree = ET.parse(pathfn) root = tree.getroot() for key in map_: for child in root.iter(key): child.text = map_[key] tree.write(pathfn)
I updated the answer, we just need to add a flag to tree.write, so the line becomes tree.write(pathfn, xml_declaration=True) that should preserve the declaration up top
0

Consider a parameterized XSLT script using Python's third-party module, lxml.

XSLT (save as .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" omit_xml_declaration="no"/>
  <xsl:strip-space elements="*"/>

  <!-- INITIALIZE PARAMETERS -->
  <xsl:param name="new_author" /> 
  <xsl:param name="new_price" /> 

  <!-- IDENTITY TRANSFORM -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- REWRITE NODE TEXT -->
  <xsl:template match="author">
    <xsl:copy>
      <xsl:value-of select="$new_author"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="price">
    <xsl:copy>
      <xsl:value-of select="$new_price"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Python (no for loops or if/else logic)

import lxml.etree as et

# LOAD XSL SCRIPT
xml = et.parse('Input.xml')
xsl = et.parse('Script.xsl')

# INITIALIZE TRANSFORMER
transform = et.XSLT(xsl)

# PASS PARAMETER TO XSLT
result = transform(xml, 
                   new_author = et.XSLT.strparam('Smith, Paul'),
                   new_price = et.XSLT.strparam('79.99'))

print(result)
# <?xml version="1.0"?>
# <catalog>
#   <book id="bk103">
#     <author>Smith, Paul</author>
#     <title>Maeve Ascendant</title>
#     <genre>Fantasy</genre>
#     <price>79.99</price>
#     <publish_date>2000-11-17</publish_date>
#     <description>After the collapse of a nanotechnology 
#               society in England, the young survivors lay the 
#               foundation for a new society.</description>
#   </book>
# </catalog>

# SAVE XML TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Online Demo

2 Comments

Hi @Parfait, thank you this looks interesting, i used "my_dir" from initial post within xml = et.parse(my_dir). unfortunately this did'nt work. Any help on what i should be doing?
et.parse is used on specific file not a directory. You will need to integrate a loop of this solution passing each XML of directory into this solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.