2

I'm trying to open an XML file and parse through it, looking through its tags and finding the text within each specific tag. If the text within the tag matches a string, I want it remove a part of the string or substitute it with something else.

My question is, I'm not sure if: start = x.find('start_char').text is actually getting the text inside "start_char" tag and saving it to the "start" variable. (Does "x.find('tag_name').text actually get the text inside the tag?)

The XML file has the following data:

<?xml version="1.0" encoding="utf-8"?>
<metadata>
    <filter>
        <regex>ATL|LAX|DFW</regex >
        <start_char>3</start_char>
        <end_char></end_char>
        <action>remove</action>
    </filter>
    <filter>
        <regex>DFW.+\.$</regex >
        <start_char>3</start_char>
        <end_char>-1</end_char>
        <action>remove</action>
    </filter>
    <filter>
        <regex>\-</regex >
        <replacement></replacement>
        <action>substitute</action>
    </filter>
    <filter>
        <regex>\s</regex >
        <replacement></replacement>
        <action>substitute</action>
    </filter>
    <filter>
        <regex> T&amp;R$</regex >
        <start_char></start_char>
        <end_char>-4</end_char>
        <action>remove</action>
    </filter>
</metadata>

The Python code I'm using is:

from xml.etree.ElementTree import ElementTree    

# filters.xml is the file that holds the things to be filtered
tree = ElementTree()
tree.parse("filters.xml")

# Get the data in the XML file 
root = tree.getroot()

# Loop through filters
for x in root.findall('filter'):

    # Find the text inside the regex tag
    regex = x.find('regex').text

    # Find the text inside the start_char tag
    start = x.find('start_char').text

    # Find the text inside the end_char tag
    end = x.find('end_char').text

    # Find the text inside the replacement tag
    #replace = x.find('replacement')

    # Find the text inside the action tag
    action = x.find('action').text

    if action == 'remove':
        if re.match(r'regex', mfn_pn, re.IGNORECASE):
            mfn_pn = mfn_pn[start:end]

    elif action == 'substitute':
        mfn_pn = re.sub(r'regex', '', mfn_pn)

    return mfn_pn
2
  • 1
    What should be value of mfn_pn variable? Commented Dec 17, 2020 at 14:18
  • It would be a barcode inputted by the user, something similar to ATL-157-1815, DFW-184-8378. Commented Dec 17, 2020 at 14:22

1 Answer 1

1

Code start = x.find('start_char').text will function in cases when filter element has start_char child, otherwise it will throw an error AttributeError: 'NoneType' object has no attribute 'text'.

This can be avoided e.g. using following approach:

# find element
start_el = x.find('start_char')
# check if element exist and assign its text to the variable, None (or another default value) otherwise
start = start_el.text if start_el is not None else None

Same applies to end variable.

Using this approach, following values will be retrieved for your example document:

3 None
3 -1
None None
None None
None -4
Sign up to request clarification or add additional context in comments.

5 Comments

Awesome, thank you so much! Using "for x in root.findall('filter'):", is it actually looping through all the data in the XML file, or does it only look at the first "filter" tag?
findall() searches for all filter elements and iterates over them.
For some reason, it's not looping through all the filter elements for me. It only goes through what's in the first filter element and stops there.
That probably happens because of the return statement inside the loop.
I took the return statement out of the loop and placed it so it’s aligned with the for loop.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.