I'm trying to extract tags from an XML file using RE in Python. I need to extract nodes that start with tag "< PE" and their corresponding Unit IDs which are nodes above each tag "<PE". The file can be seen here
When I use the below code, I don't get the correct tags "<unit IDs", that is, the ones that correspond to each tag "<PE". For example, in my output, the content extracted from tag "<PE" with "<Unit ID=250" is actually "<Unit ID=149" in the original file. Besides, the code skips some tags "<Unit ID". Does anyone see in my code where's the error?
import re
t=open('ALICE.per1_replaced.txt','r')
t=t.read()
unitid=re.findall('<unit.*?"pe">', t, re.DOTALL)
PE=re.findall("<PE.*?</PE>", t, re.DOTALL)
a=zip(unitid,PE)
tp=tuple(a)
w=open('Tags.txt','w')
for x, j in tp:
a=x + '\n'+j + '\n'
w.write(a)
w.close()
I've tried this version as well but I had the same problems:
with open('ALICE.per1_replaced.txt','r') as t:
contents = t.read()
unitid=re.findall('<unit.*?"pe">', contents, re.DOTALL)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtagsper1.txt','w') as fi:
for i, p in zip(unitid, PE):
fi.write( "{}\n{}\n".format(i, p))
my desired output is a file with tags "<Unit ID=" followed by the content within the tag that starts with "<PE" and ends with "" as below:
<unit id="16" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
<head>
</head>
<body>
Eu vou me atrasar!' (quando ela voltou a pensar sobre isso mais trade,
ocorreu-lhe que deveria ter achado isso curioso, mas na hora tudo pareceu
bastante natural); mas quando o Coelho de fato tirou um relógio do bolso
do colete e olhou-o, e então se apressou, Alice pôs-se de pé, pois lhe
ocorreu que nunca antes vira um coelho com um colete, ou com um relógio de
bolso pra tirar, e queimando de curiosidade, ela atravessou o campo atrás
dele correndo e, felizmente, chegou justo a tempo de vê-lo entrar dentro
de uma grande toca de coelho sob a cerca.
</body>
</html></PE>