How to remove several lines in xml file and then save it in Python

Question

I want to remove all lines that contain all words in the 'xml_lines' list. I created this script:

from pathlib import Path

# Provide relative or absolute file path to your xml file
filename = './.content.xml'
path = Path(filename)

conntents = path.read_text()

xml_lines = [
    'first',
    'second',
]

lines = conntents.splitlines()

removed_lines = 0

for line in lines:
    for xml_line in xml_lines:
        if xml_line in line:
            lines.remove(line)
            removed_lines += 1
            print(f'Line: "{line.strip()}" has been removed!')

print(f"\n\n{removed_lines} lines have been removded!")

path.write_text(str(lines))

At the and I have a file that does not look like xml. Can anyone help?

Example (before):

<?xml version="1.0"?>
<data>
    <country
        name="Liechtenstein"
        first="2d2md"
        second="m3d39d93">
            <rank updated="yes">2</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
    </country>
    <tiger
        name="Singapore"
        first="hfdfherbre"
        second="m3d39d93">
            <rank updated="yes">5</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
    </tiger>
    <car
        name="Panama"
        first="th54b4"
        second="45b45gt45h">
            <rank updated="yes">69</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
    </car>
</data>

if script finds any line that contain 'first' or 'second', the entire line should be removed:

<?xml version="1.0"?>
<data>
    <country
        name="Liechtenstein"
        >
            <rank updated="yes">2</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
    </country>
    <tiger
        name="Singapore"
        >
            <rank updated="yes">5</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
    </tiger>
    <car
        name="Panama">
        >
            <rank updated="yes">69</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
    </car>
</data>

This is only an example, entire xml file consists of 9999999 lines...

If you remove arbitrary lines from an XML document it's highly likely that you'll corrupt it. You need to use something that understands XML (e.g., xml.etree) then remove the element(s) from the document using appropriate functions from that module. Then rewrite the file. Also, never modify a list while you're iterating over it (unless you like surprises). Give an example of your XML document and what you want to remove — jackal
– jackal, Commented May 29, 2023 at 16:04
Show an example XML you want to modify. Generally in XML there's not such thing as "lines" - you might want to remove nodes with certain name, attribute or value. E.g. <first first='first'>first</first>. XML node and attribute names are case-insensitive, while values are. — Pawel
– Pawel, Commented May 29, 2023 at 16:16
You can look here: stackoverflow.com/questions/3593204/… I think this will do what you need. — user1200296
– user1200296, Commented May 29, 2023 at 17:15
Avoid treating XML as a text file. See What's so bad about building XML with string concatenation? Use compliant DOM libraries like Python's etree or lxml. — Parfait
– Parfait, Commented May 29, 2023 at 17:38

Parfait · Accepted Answer · 2023-05-30 11:50:38Z

1

Consider XSLT the special-purpose language designed to transform XML files. Specifically, an identity template and empty template can remove the needed attributes across entire document without a single for loop. Python's lxml third-party package can run XSLT 1.0 scripts.

XSLT (save as .xsl file, a special XML file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="utf-8" indent="yes"/>
    <xsl:strip-space elements="*"/>
    
    <!-- IDENTITY TRANSFORM -->
    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- EMPTY TEMPLATE TO REMOVE CONTENT -->
    <xsl:template match="@first|@second"/>
</xsl:stylesheet>

Online Demo

Python

import lxml.etree as lx

# PARSE XML AND XSLT
doc = lx.parse("Input.xml")
style = lx.parse("Style.xsl")

# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)

# OUTPUT TO FILE
result.write_output("Output.xml")

edited May 30, 2023 at 11:50

answered May 29, 2023 at 20:49

Parfait

108k19 gold badges103 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mag Over a year ago

I have this kind of error: lxml.etree.XSLTParseError: xsltCompilePattern : failed to compile '@first'

Parfait Over a year ago

Hmmm...I tested your exact posted XML and my XSLT and did not face any lxml error. What Python version are you running import sys; print(sys.version) and lxml version: print(lxml.__version__)?

Jack Fleeting · Accepted Answer · 2023-05-29 22:26:15Z

0

You could do something simple along the lines described in this answer, basically using xpath and lxml (and there may be other ways to do the same):

from lxml import etree
doc = etree.parse("your xml file")

to_drop = ["first","second"]
for td in to_drop:
    for target in doc.xpath('//*'):
        target.attrib.pop(td, None)
print(etree.tostring(doc).decode())

Output should be your expected output.

answered May 29, 2023 at 22:26

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Comments

Hermann12 · Accepted Answer · 2023-05-31 09:47:46Z

For huge xml files you can use iterparse() and manipulate the attribute values:

import xml.etree.ElementTree as ET

filename = "outfile.xml"
with open(filename, 'wb') as out:
    out.write(str.encode('<?xml version="1.0"?>\n<data>\n'))

attrib_list = ['first','second']

def removekey(d, keys):
    r = dict(d)
    for key, value in keys.items():
        del r[key]
    return r

for event, elem in ET.iterparse("pop_del.xml", events=("start","end")):
    n = {k: elem.attrib[k] for k in elem.attrib.keys() & set(attrib_list)}
    if len(n) != 0:
        elem.attrib = removekey(elem.attrib, n)
        with open("outfile.xml", 'ab') as out:
            out.write(ET.tostring(elem))
            
with open(filename, 'ab') as out:
    out.write(str.encode('</data>'))

Output:

<?xml version="1.0"?>
<data>
  <country name="Liechtenstein">
    <rank updated="yes">2</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E" />
    <neighbor name="Switzerland" direction="W" />
  </country>
  <tiger name="Singapore">
    <rank updated="yes">5</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N" />
  </tiger>
  <car name="Panama">
    <rank updated="yes">69</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W" />
    <neighbor name="Colombia" direction="E" />
  </car>
</data>

You can use pop() or del() to remove a attribute from tag element.

Collectives™ on Stack Overflow

How to remove several lines in xml file and then save it in Python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related