How to delete parts of XML data and write it to a new file with Python

Question

I have a data structure such as following. The input file is pretty large and thus I am trying to find an efficient method.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>

given an input file containing number of files such as

1
3

it would remove the segments that has those name. For example, 1 and 3 was given so segments with names 1 and 3 has been removed.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
  </recording>
</corpus>

the code I have so far

from lxml import etree

with open("g.xml", "r") as xml_file:
    xml_data = xml_file.read()

with open('del_names.txt', 'r') as file:
    list_of_names = file.read().split("\n")

new_xml = xml_data
for each_name in list_of_names:
    print(each_name)
    tree = etree.XML(new_xml.encode())
    find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
    for each_segment in find_segments:
        each_segment.getparent().remove(each_segment)
    new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")

print(new_xml)

The problem with the code is that, I ran the code for 2 hours now and it didn't even output a single line. I am not sure what efficient way I could do this.

How do I accomplish this? I also think having 2 might be unnecessary is that correct?

"having 2" - do you mean "having 2 loops"?

mkrieger1
– mkrieger1

2021-01-24 21:46:16 +00:00
Commented Jan 24, 2021 at 21:46 — mkrieger1
– mkrieger1, Commented Jan 24, 2021 at 21:46

Lydia van Dyke · Accepted Answer · 2021-01-24 22:13:27Z

1

If your code takes longer than expected, you can always start with some print statements to get a better idea were time is spent.

For your task a single loop should suffice. Iterate over all 'segment' elements in the xml file. When a segment's name is included in the del_names.txt file, delete it.

To make lookup for names faster, I convert the list of names to a set.

from lxml import etree

with open("g.xml", "r") as xml_file:
    xml_data = xml_file.read()
print("read xml data")

with open('del_names.txt', 'r') as file:
    names_to_delete = set(file.read().split("\n"))
print("read text data")

new_xml = xml_data
tree = etree.XML(new_xml.encode())

for segment in tree.xpath("*//segment"):
    name = segment.attrib.get("name")
    if name in names_to_delete:
        print(f"will delete segment '{name}'")
        segment.getparent().remove(segment)

print(" result ".center(80, "="))

new_xml = str(etree.tostring(tree, encoding="unicode", pretty_print=True))
print(new_xml)

Output:

read xml data
read text data
will delete segment '1'
will delete segment '3'
==================================== result ====================================
<?xml version='1.0' encoding='ASCII'?>
<corpus name="corpus">
    <recording audio="audio.wav" name="first audio">
        <segment name="2" start="2" end="4">
            <orth>some text 2</orth>
        </segment>
    </recording>
</corpus>

edited Jan 24, 2021 at 22:13

answered Jan 24, 2021 at 22:04

Lydia van Dyke

2,5263 gold badges15 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Joseph Kars Over a year ago

Thanks for your answer! It works perfectly, except some utf8 problems. For example, dáár turns into dáár. How should I fix this?

Lydia van Dyke Over a year ago

Please try my updated answer. new_xml = str(etree.tostring(tree, encoding="unicode", pretty_print=True)) should do the trick.

Jakub Szlaur · Accepted Answer · 2021-01-24 22:08:20Z

0

You can also use BeautifulSoup:

from bs4 import BeautifulSoup

my_string = """ <?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus> """

soup = BeautifulSoup(my_string, 'html.parser')
ids = [1,3] #IDs to delete

for id in ids:
    elements = soup.find_all("segment", attrs = {"name": str(id)})
    for element in elements:
        element.decompose()
    
print(soup.prettify())

answered Jan 24, 2021 at 22:08

Jakub Szlaur

2,1623 gold badges21 silver badges48 bronze badges

1 Comment

Jakub Szlaur Over a year ago

If the answer helped you in some way please consider giving it +1! :)

Collectives™ on Stack Overflow

How to delete parts of XML data and write it to a new file with Python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related