1

Using the following code, I load in an XML file that contains email data.

from xml.etree import ElementTree

with open(xmlFile, "r") as f:
    xml = ElementTree.parse(f)

I then initialize all of my variables (subsetted here for brevity):

index = []
sender = []
subject = []
date = []

And then lastly try to loop through the emails:

for node in xml.findall(".//header"): 
  index.append(node.attrib.get('index')) 
  sender.append(node.attrib.get('from')) 
  subject.append(node.attrib.get('subject')) 
  date.append(node.attrib.get('date'))

The problem is that when I do that, I get the wrong output. Now, I can't provide the data because it's confidential, but I can give what I believe should be enough to get me looking in the right direction for what's going wrong.

In [127]: nodes = xml.findall(".//header")
In [128]: len(nodes)
Out[128]: 12018

In [129]: len(index)
Out[129]: 48072

In [130]: nodes[0].attrib.viewkeys()
Out[130]: dict_keys(['index', 'from', 'read', 'headerLink', 'messageType', 'contentLink', 'state', 'messageId', 'date', 'folder', 'folderId', 'rawLink', 'subject'])

In [130]: index[0:3]
Out[131]: 
['0',
 '(NYTimes.com News Alert) [email protected]',
 'Breaking News:  At Florida State, Football Eclipses Justice: Records Show Police Often Go Easy on Players']

In [132]: for node in xml.findall(".//header")[0:3]: print(node.attrib.get("index"))
0
1
2

Any thoughts on what I'm missing? I'm pretty new to Python, but not coding, and I can't see where I'm going wrong. Thanks in advance!

7
  • Did you run your for loop 4 times in the interactive python interpreter? Without reinitializing index and other lists ? Commented Jul 20, 2015 at 17:45
  • @AnandSKumar No, I didn't. If that were the problem the first three values in index would still be [0, 1, 2], but then the whole value set would repeat at position 12018. As you can see above, there are values in index that should not be there, which implies it's something other than rerunning the loop without reinitializing. Commented Jul 20, 2015 at 17:48
  • is that your exact code? Are you sure you also did not by mistake append everything to index ? Commented Jul 20, 2015 at 17:50
  • Yes, that is my exact code. Commented Jul 20, 2015 at 17:52
  • Are you sure you did not do index = sender = subject = date = [] , for ease? Commented Jul 20, 2015 at 17:54

1 Answer 1

1

From comments we can see that you did -

index = sender = subject = date = []

When you do the above, it actually only creates 1 list, and all the names - index , sender , subject , date are pointing to that one list. To show that all names are pointing to same list -

>>> index = sender = subject = date = []
>>> id(index)
8237464
>>> id(sender)
8237464
>>> id(subject)
8237464
>>> id(date)
8237464

And then when you do -

for node in xml.findall(".//header"): 
  index.append(node.attrib.get('index')) 
  sender.append(node.attrib.get('from')) 
  subject.append(node.attrib.get('subject')) 
  date.append(node.attrib.get('date'))

All the 4 items are added to your single list (which is being referred to by all of the names/variables) . And that is the reason you are seeing all of the data in one list.

You should define each list separately as you gave in your Example and not using the above method -

index = []
sender = []
subject = []
date = []
Sign up to request clarification or add additional context in comments.

1 Comment

That fixed it. Great explanation as to what's actually happening as well. Helps me learn as well as fixed my problem. Thanks, Anand!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.