-1

I want to remove all lines that contain all words in the 'xml_lines' list. I created this script:

from pathlib import Path

# Provide relative or absolute file path to your xml file
filename = './.content.xml'
path = Path(filename)

conntents = path.read_text()

xml_lines = [
    'first',
    'second',
]

lines = conntents.splitlines()

removed_lines = 0

for line in lines:
    for xml_line in xml_lines:
        if xml_line in line:
            lines.remove(line)
            removed_lines += 1
            print(f'Line: "{line.strip()}" has been removed!')

print(f"\n\n{removed_lines} lines have been removded!")

path.write_text(str(lines))

At the and I have a file that does not look like xml. Can anyone help?

Example (before):

<?xml version="1.0"?>
<data>
    <country
        name="Liechtenstein"
        first="2d2md"
        second="m3d39d93">
            <rank updated="yes">2</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
    </country>
    <tiger
        name="Singapore"
        first="hfdfherbre"
        second="m3d39d93">
            <rank updated="yes">5</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
    </tiger>
    <car
        name="Panama"
        first="th54b4"
        second="45b45gt45h">
            <rank updated="yes">69</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
    </car>
</data>

if script finds any line that contain 'first' or 'second', the entire line should be removed:

<?xml version="1.0"?>
<data>
    <country
        name="Liechtenstein"
        >
            <rank updated="yes">2</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
    </country>
    <tiger
        name="Singapore"
        >
            <rank updated="yes">5</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
    </tiger>
    <car
        name="Panama">
        >
            <rank updated="yes">69</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
    </car>
</data>

This is only an example, entire xml file consists of 9999999 lines...

9
  • 2
    If you remove arbitrary lines from an XML document it's highly likely that you'll corrupt it. You need to use something that understands XML (e.g., xml.etree) then remove the element(s) from the document using appropriate functions from that module. Then rewrite the file. Also, never modify a list while you're iterating over it (unless you like surprises). Give an example of your XML document and what you want to remove Commented May 29, 2023 at 16:04
  • 1
    Show an example XML you want to modify. Generally in XML there's not such thing as "lines" - you might want to remove nodes with certain name, attribute or value. E.g. <first first='first'>first</first>. XML node and attribute names are case-insensitive, while values are. Commented May 29, 2023 at 16:16
  • 1
    It is better to use XSLT for the task. Are you open to it? Commented May 29, 2023 at 17:03
  • 1
    You can look here: stackoverflow.com/questions/3593204/… I think this will do what you need. Commented May 29, 2023 at 17:15
  • 1
    Avoid treating XML as a text file. See What's so bad about building XML with string concatenation? Use compliant DOM libraries like Python's etree or lxml. Commented May 29, 2023 at 17:38

3 Answers 3

1

Consider XSLT the special-purpose language designed to transform XML files. Specifically, an identity template and empty template can remove the needed attributes across entire document without a single for loop. Python's lxml third-party package can run XSLT 1.0 scripts.

XSLT (save as .xsl file, a special XML file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="utf-8" indent="yes"/>
    <xsl:strip-space elements="*"/>
    
    <!-- IDENTITY TRANSFORM -->
    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- EMPTY TEMPLATE TO REMOVE CONTENT -->
    <xsl:template match="@first|@second"/>
</xsl:stylesheet>

Online Demo

Python

import lxml.etree as lx

# PARSE XML AND XSLT
doc = lx.parse("Input.xml")
style = lx.parse("Style.xsl")

# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)

# OUTPUT TO FILE
result.write_output("Output.xml")
Sign up to request clarification or add additional context in comments.

2 Comments

I have this kind of error: lxml.etree.XSLTParseError: xsltCompilePattern : failed to compile '@first'
Hmmm...I tested your exact posted XML and my XSLT and did not face any lxml error. What Python version are you running import sys; print(sys.version) and lxml version: print(lxml.__version__)?
0

You could do something simple along the lines described in this answer, basically using xpath and lxml (and there may be other ways to do the same):

from lxml import etree
doc = etree.parse("your xml file")

to_drop = ["first","second"]
for td in to_drop:
    for target in doc.xpath('//*'):
        target.attrib.pop(td, None)
print(etree.tostring(doc).decode())

Output should be your expected output.

Comments

0

For huge xml files you can use iterparse() and manipulate the attribute values:

import xml.etree.ElementTree as ET

filename = "outfile.xml"
with open(filename, 'wb') as out:
    out.write(str.encode('<?xml version="1.0"?>\n<data>\n'))

attrib_list = ['first','second']

def removekey(d, keys):
    r = dict(d)
    for key, value in keys.items():
        del r[key]
    return r

for event, elem in ET.iterparse("pop_del.xml", events=("start","end")):
    n = {k: elem.attrib[k] for k in elem.attrib.keys() & set(attrib_list)}
    if len(n) != 0:
        elem.attrib = removekey(elem.attrib, n)
        with open("outfile.xml", 'ab') as out:
            out.write(ET.tostring(elem))
            
with open(filename, 'ab') as out:
    out.write(str.encode('</data>'))

Output:

<?xml version="1.0"?>
<data>
  <country name="Liechtenstein">
    <rank updated="yes">2</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E" />
    <neighbor name="Switzerland" direction="W" />
  </country>
  <tiger name="Singapore">
    <rank updated="yes">5</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N" />
  </tiger>
  <car name="Panama">
    <rank updated="yes">69</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W" />
    <neighbor name="Colombia" direction="E" />
  </car>
</data>

You can use pop() or del() to remove a attribute from tag element.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.