0

I'm trying to extract tags from an XML file using RE in Python. I need to extract nodes that start with tag "< PE" and their corresponding Unit IDs which are nodes above each tag "<PE". The file can be seen here

When I use the below code, I don't get the correct tags "<unit IDs", that is, the ones that correspond to each tag "<PE". For example, in my output, the content extracted from tag "<PE" with "<Unit ID=250" is actually "<Unit ID=149" in the original file. Besides, the code skips some tags "<Unit ID". Does anyone see in my code where's the error?

import re

t=open('ALICE.per1_replaced.txt','r')

t=t.read()




unitid=re.findall('<unit.*?"pe">', t, re.DOTALL)
PE=re.findall("<PE.*?</PE>", t, re.DOTALL)


a=zip(unitid,PE)

tp=tuple(a)


w=open('Tags.txt','w')

for x, j in tp:
    a=x + '\n'+j + '\n'

    w.write(a)



w.close()

I've tried this version as well but I had the same problems:

with open('ALICE.per1_replaced.txt','r') as t:
  contents = t.read()

unitid=re.findall('<unit.*?"pe">', contents,  re.DOTALL)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtagsper1.txt','w') as fi:
    for i, p in zip(unitid, PE):
        fi.write( "{}\n{}\n".format(i, p))

my desired output is a file with tags "<Unit ID=" followed by the content within the tag that starts with "<PE" and ends with "" as below:

<unit id="16" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
  <head>

  </head>
  <body>
    Eu vou me atrasar!' (quando ela voltou a pensar sobre isso mais trade, 
    ocorreu-lhe que deveria ter achado isso curioso, mas na hora tudo pareceu 
    bastante natural); mas quando o Coelho de fato tirou um relógio do bolso 
    do colete e olhou-o, e então se apressou, Alice pôs-se de pé, pois lhe 
    ocorreu que nunca antes vira um coelho com um colete, ou com um relógio de 
    bolso pra tirar, e queimando de curiosidade, ela atravessou o campo atrás 
    dele correndo e, felizmente, chegou justo a tempo de vê-lo entrar dentro 
    de uma grande toca de coelho sob a cerca.
  </body>
</html></PE>
1
  • 2
    Regex by itself probably isn't the tool for the job here. Why not use the xml or beautifulSoup package? Commented Feb 2, 2021 at 18:13

1 Answer 1

1

You seem to have multiple tags under each tag (eg, for unit 3), thus the zip doesn't work correctly. As @Error_2646 noted in comments, some XML or beautiful soup package would work better for this job.

But if for whatever reason you want to stick to regex, you can fix this by running a regex on the list of strings returned by the first regex. Example code that worked on the small part of the input I downloaded:

units=re.findall('<unit.*?</unit>', t, re.DOTALL)
unitList = []
for unit in units:
    #first get your unit regex
    unitid=re.findall('<unit.*?"pe">', unit, re.DOTALL) # same as the one you use
    #there should only be one within each
    assert (len(unitid) == 1)
    #now find all pes for this unit
    PE=re.findall("<PE.*?</PE>", unit, re.DOTALL) # same as the one you use
    # combine results
    output = unitid[0] + "\n"
    for pe in PE:
        output += pe + "\n"
    unitList.append(output)

for x in unitList:
    print(x)
Sign up to request clarification or add additional context in comments.

2 Comments

This code works well. What does "assert (len(unitid) == 1)" do?
That is something I added as a sanity check to make sure that there is exactly one unit tag for each loop iteration. If it fails, it exits the program after printing an assertion failure error

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.