Using the following code, I load in an XML file that contains email data.
from xml.etree import ElementTree
with open(xmlFile, "r") as f:
xml = ElementTree.parse(f)
I then initialize all of my variables (subsetted here for brevity):
index = []
sender = []
subject = []
date = []
And then lastly try to loop through the emails:
for node in xml.findall(".//header"):
index.append(node.attrib.get('index'))
sender.append(node.attrib.get('from'))
subject.append(node.attrib.get('subject'))
date.append(node.attrib.get('date'))
The problem is that when I do that, I get the wrong output. Now, I can't provide the data because it's confidential, but I can give what I believe should be enough to get me looking in the right direction for what's going wrong.
In [127]: nodes = xml.findall(".//header")
In [128]: len(nodes)
Out[128]: 12018
In [129]: len(index)
Out[129]: 48072
In [130]: nodes[0].attrib.viewkeys()
Out[130]: dict_keys(['index', 'from', 'read', 'headerLink', 'messageType', 'contentLink', 'state', 'messageId', 'date', 'folder', 'folderId', 'rawLink', 'subject'])
In [130]: index[0:3]
Out[131]:
['0',
'(NYTimes.com News Alert) [email protected]',
'Breaking News: At Florida State, Football Eclipses Justice: Records Show Police Often Go Easy on Players']
In [132]: for node in xml.findall(".//header")[0:3]: print(node.attrib.get("index"))
0
1
2
Any thoughts on what I'm missing? I'm pretty new to Python, but not coding, and I can't see where I'm going wrong. Thanks in advance!
forloop 4 times in the interactive python interpreter? Without reinitializingindexand other lists ?indexwould still be [0, 1, 2], but then the whole value set would repeat at position 12018. As you can see above, there are values inindexthat should not be there, which implies it's something other than rerunning the loop without reinitializing.appendeverything toindex?index = sender = subject = date = [], for ease?