0

I'm new to Python and XML and trying to parse through the file below in order to extract several elements. The issue is that some elements are empty (example customer xyz1 does not have any address information).

<CAT>
  <Header>...</Header>
  <Add>...</Add>
  <Customer>
    <Id_Customer>xyz1</Id_Customer>
    <Segment>abc1</Segment>
    <Event>
      <Nature>info1</Nature>
      <Extrainfo>info2</Extrainfo>
    </Event>
</Customer>
<Customer>
    <Id_Customer>zzwy</Id_Customer>
    <Segment>c2</Segment>
    <Adress>
       <zipcode>77098</zipcode>
       <street>belaire drive</street>
       <number>5</number>
    </Adress>
</Customer>
<Customer>...</Customer>
</CAT>

I'm looping through the following elements (Id_Customer, Segment, Extrainfo, zipcode, street) in order to build up a list that I will then export to a .csv file.

My code below generates the following output : [xyz1,abc1,info2,zzwy,c2 ..] while I would like elements not found to be input in the list as "empty" so that my list would contain : [xyz1,abc1,info2,empty,empty, zzwy,c2 ..]

Here is a sample of my code :

from xml.etree import ElementTree
import csv

list_prm = []

tree = ElementTree.parse('file.xml')
root = tree.getroot()

for elem in tree.iter():
    if elem.findall('Id_Customer'):
        list_prm.append(elem.text)
    if elem.tag == 'Segment':
        list_prm.append(elem.text)
    if elem.tag == 'Extrainfo':
        list_prm.append(elem.text)
    if elem.tag == 'street':
        list_prm.append(elem.text)
    if elem.tag == 'zipcode':
        list_prm.append(elem.text)


print(list_prm)

I would very much appreciate some help. (I can only use standard python library.)

2
  • By 'empty' do you mean None? Because there's no such thing as "empty". If you want your list to contain a placeholder, you have to decide what you want it to be - lists only contain objects, they don't know how to contain "nothing" Commented May 10, 2017 at 16:39
  • btw, the 'Adress' tag needs another 'd' Commented Jul 23, 2020 at 18:32

2 Answers 2

1

Your main problem is that you're literally just plopping data from the XML into the CSV in pretty much the same state you found it. The elements you are referring to as being "empty" are not empty, they are not present in the XML.

I can think of two approaches you might use to make this work better. The first would be to change your XML such that every <Customer> element contains all the elements in the same order, even if the elements are completely empty. In other words your XML might look like this:

<Customer>
    <Id_Customer>xyz1</Id_Customer>
    <Segment>abc1</Segment>
    <Event>
      <Nature>info1</Nature>
      <Extrainfo>info2</Extrainfo>
    </Event>
    <Adress>
       <zipcode></zipcode>
       <street></street>
       <number></number>
    </Adress>
</Customer>
<Customer>
    <Id_Customer>zzwy</Id_Customer>
    <Segment>c2</Segment>
    <Event>
      <Nature></Nature>
      <Extrainfo></Extrainfo>
    </Event>
    <Adress>
       <zipcode>77098</zipcode>
       <street>belaire drive</street>
       <number>5</number>
    </Adress>
</Customer>

If you want you could add a condition in your Python code that would replace the empty string ("") with the word "empty" since you indicated that's what you wanted it to say.

The other approach would make for a lot more complicated Python code but is honestly probably the better approach. That would be to use either a class or a dict to sort the data: one dict or object per <Customer> tag. With what you're doing I'd say creating a class might be overkill, so a dict should be enough. (Using a defaultdict rather than an ordinary dict would allow you to automatically supply the word "empty" when no value was found, so I'd look into that.)

Basically the flow of the program would go like this:

  1. Create an empty list to store your dicts. customers = []
  2. Loop through the <Customer> elements in the XML tree. For each customer:
    1. Create a new dict and add it to the list. customer={} or customer=defaultdict("empty"), then customers.append(customer)
    2. Loop through that element's child elements, and for each one populate the the dict with it's info. Something like customer[elem.tag]=elem.text may be what you're looking for.
  3. Create a list of all the dict keys you want to grab from, in the same order as the headers in your CSV. For example keys=["Id_Customer", "Segment", etc...]
  4. Loop through the list you created in Steps 1 and 2. e.g. for customer in customers: For each iteration:
    1. Loop through the list you created in Step 3. e.g. for key in keys:
    2. For each key, get the corresponding value from the dict, and add that value to your CSV output. Assuming you have an open file object called "csv", something like this would work: csv.write(customer[key]) (Of course you'll want to write the comma to the file as well at this point, unless it's the last iteration of the keys loop, then write a newline instead. You can test that with key == keys[-1])
Sign up to request clarification or add additional context in comments.

Comments

0

Have a look at method findtext of xml.etree (https://docs.python.org/3.6/library/xml.etree.elementtree.html), default value.

I guess something like the following might work (not tested), with each customer in a separate list (as line in csv file), that then gets inserted into the general list_prn list. Of course, you would have to iterate over the lists when building the csv file.

If you really wanted all the elements values in one list, you could skip the creation of cust list and insert the values directly into list_prn.

It all suposes that all the subelements of Customer are there only once.

from xml.etree import ElementTree
import csv

list_prm = []

tree = ElementTree.parse('file.xml')
root = tree.getroot()

for elem in tree.iter('Customer'):
    # only the first customer_id
    customer_id = elem.find('Id_Customer')
    if customer_id is not None:
        # Create a separate list for each Customer,
        # only if there's Customer Id, skip creation otherwise
        cust = []

        cust.append(customer_id.text())
        cust.append(elem.findtext('Segment', default='empty'))
        cust.append(elem.findtext('Extrainfo', default='empty'))
        cust.append(elem.findtext('Address/street', default='empty'))
        cust.append(elem.findtext('Address/zipcode', default='empty'))

        list_prm.append(cust)


print(list_prm)

3 Comments

This looked like a quick win but unfortunately it does not work. I replaced because it would bug : cust.append(customer_id.text()) by cust.append(customer_id.text) But the remaining of the code only captures the customer_ID, but the other values are always set to the defaut value 'empty'.
added .// in the list_prm.append(elem.findtext('.//xxxxxx', default='empty')) and it works ! Cheers !
Sorry for the mistake. Sure, if you want everything in one list (no sublists per Customer), that should work. I'm happy you got it to work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.