3

I have a data structure such as following. The input file is pretty large and thus I am trying to find an efficient method.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>

given an input file containing number of files such as

1
3

it would remove the segments that has those name. For example, 1 and 3 was given so segments with names 1 and 3 has been removed.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
  </recording>
</corpus>

the code I have so far

from lxml import etree

with open("g.xml", "r") as xml_file:
    xml_data = xml_file.read()

with open('del_names.txt', 'r') as file:
    list_of_names = file.read().split("\n")

new_xml = xml_data
for each_name in list_of_names:
    print(each_name)
    tree = etree.XML(new_xml.encode())
    find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
    for each_segment in find_segments:
        each_segment.getparent().remove(each_segment)
    new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")

print(new_xml)

The problem with the code is that, I ran the code for 2 hours now and it didn't even output a single line. I am not sure what efficient way I could do this.

How do I accomplish this? I also think having 2 might be unnecessary is that correct?

1
  • "having 2" - do you mean "having 2 loops"? Commented Jan 24, 2021 at 21:46

2 Answers 2

1

If your code takes longer than expected, you can always start with some print statements to get a better idea were time is spent.

For your task a single loop should suffice. Iterate over all 'segment' elements in the xml file. When a segment's name is included in the del_names.txt file, delete it.

To make lookup for names faster, I convert the list of names to a set.

from lxml import etree

with open("g.xml", "r") as xml_file:
    xml_data = xml_file.read()
print("read xml data")

with open('del_names.txt', 'r') as file:
    names_to_delete = set(file.read().split("\n"))
print("read text data")

new_xml = xml_data
tree = etree.XML(new_xml.encode())

for segment in tree.xpath("*//segment"):
    name = segment.attrib.get("name")
    if name in names_to_delete:
        print(f"will delete segment '{name}'")
        segment.getparent().remove(segment)

print(" result ".center(80, "="))

new_xml = str(etree.tostring(tree, encoding="unicode", pretty_print=True))
print(new_xml)

Output:

read xml data
read text data
will delete segment '1'
will delete segment '3'
==================================== result ====================================
<?xml version='1.0' encoding='ASCII'?>
<corpus name="corpus">
    <recording audio="audio.wav" name="first audio">
        <segment name="2" start="2" end="4">
            <orth>some text 2</orth>
        </segment>
    </recording>
</corpus>
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer! It works perfectly, except some utf8 problems. For example, dáár turns into d&#225;&#225;r. How should I fix this?
Please try my updated answer. new_xml = str(etree.tostring(tree, encoding="unicode", pretty_print=True)) should do the trick.
0

You can also use BeautifulSoup:

from bs4 import BeautifulSoup

my_string = """ <?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus> """

soup = BeautifulSoup(my_string, 'html.parser')
ids = [1,3] #IDs to delete

for id in ids:
    elements = soup.find_all("segment", attrs = {"name": str(id)})
    for element in elements:
        element.decompose()
    
print(soup.prettify())

1 Comment

If the answer helped you in some way please consider giving it +1! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.