Parsing an XML-like file in Python

Question

I have this file which contains several math tags like so:

<Math 
   <Unique 262963>
   <BRect  1.02176" 0.09096" 1.86024" 0.40658">
   <MathFullForm `equal[therefore[char[tau]],plus[indexes[0,1,char[tau],char[c]],minus[times[indexes[
0,1,char[tau],char[s]],string[" and  "],over[times[char[d],char[omega]],times[char[
d],char[t]]]]]],over[char[tau],char[I]]]'
   > # end of MathFullForm
   <MathLineBreak  138.88883">
   <MathOrigin  1.95188" 0.32125">
   <MathAlignment Center>
   <MathSize MathMedium>
> # end of Math

And like so:

<Math 
   <Unique 87795>
   <Separation 0>
   <ObColor `Black'>
   <RunaroundGap  0.0 pt>
   <BRect  0.01389" 0.01389" 0.17519" 0.22013">
   <MathFullForm `indexes[0,1,char[m,0,0,1,0,0],char[i]]'
> # end of MathFullForm

And I want to extract the contents of the Unique tag and the MathFullForm tag, but I am at a loss at how to do so. Note that Unique tags exist elsewhere in the file, outside of Math tags.

I've tried using regex but that doesn't work too well and misses many of the tags. I then thought about using an XML parser, but that wouldn't work because the code isn't valid XML.

Can anyone steer me in the right direction to do this in Python (a regex solution is acceptable).

Is your XML-like format an understood standard? I've not come across it before. — chocksaway
– chocksaway, Commented Jul 26, 2017 at 12:43
@chocksaway It is for Adobe Framemaker: help.adobe.com/en_US/framemaker/mifreference/mifref.pdf — Beta Decay
– Beta Decay, Commented Jul 26, 2017 at 12:45

Rohan Amrute · Accepted Answer · 2017-07-26 12:40:42Z

1

You could use a loop to remove the tag. re.finditer() can be used to iteratively extract the tags.

Check the below code and see if it works for you.

text = re.sub(r'\r|\n',' ',text)
for m in re.finditer(r'(\<Unique\s).*?\>',text):
   print m.group()
for m in re.finditer(r'(\<MathFullForm\s).*?\>',text):
   print m.group()

answered Jul 26, 2017 at 12:40

Rohan Amrute

7841 gold badge9 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Marco Luzzara · Accepted Answer · 2017-07-26 12:27:28Z

0

You can use this regex, specifying the DOTALL flag(otherwise the . would not match the \n too):

<(Unique|MathFullForm)(.*?)>

The first capturing group says if the match belongs to the Unique or MathFullForm tag, whereas in the second you can find the content of the tag.

answered Jul 26, 2017 at 12:27

Marco Luzzara

6,1864 gold badges25 silver badges51 bronze badges

3 Comments

Beta Decay Over a year ago

Sorry, I should have mentioned that Unique tags exist outside of Math tags as well

Marco Luzzara Over a year ago

Why should this one be a problem for this regex?

Beta Decay Over a year ago

I want the Unique number which corresponds to the MathFullForm

Beta Decay · Accepted Answer · 2017-07-26 12:43:40Z

0

I have found the solution by using the following regex:

<Math\s*<Unique[^>]*>\s*(?:<Separation[^>]*>)*\s*(?:<ObColor[^>]*>)*\s*(?:<RunaroundGap[^>]*>)*\s*<BRect[^>]*>\s*<MathFullForm `[^']*'

This matches the whole tag, so I can use two more regexes to extract the necessary information.

answered Jul 26, 2017 at 12:43

Beta Decay

8031 gold badge8 silver badges21 bronze badges

Collectives™ on Stack Overflow

Parsing an XML-like file in Python

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related