0

I have this file which contains several math tags like so:

<Math 
   <Unique 262963>
   <BRect  1.02176" 0.09096" 1.86024" 0.40658">
   <MathFullForm `equal[therefore[char[tau]],plus[indexes[0,1,char[tau],char[c]],minus[times[indexes[
0,1,char[tau],char[s]],string[" and  "],over[times[char[d],char[omega]],times[char[
d],char[t]]]]]],over[char[tau],char[I]]]'
   > # end of MathFullForm
   <MathLineBreak  138.88883">
   <MathOrigin  1.95188" 0.32125">
   <MathAlignment Center>
   <MathSize MathMedium>
> # end of Math

And like so:

<Math 
   <Unique 87795>
   <Separation 0>
   <ObColor `Black'>
   <RunaroundGap  0.0 pt>
   <BRect  0.01389" 0.01389" 0.17519" 0.22013">
   <MathFullForm `indexes[0,1,char[m,0,0,1,0,0],char[i]]'
> # end of MathFullForm

And I want to extract the contents of the Unique tag and the MathFullForm tag, but I am at a loss at how to do so. Note that Unique tags exist elsewhere in the file, outside of Math tags.

I've tried using regex but that doesn't work too well and misses many of the tags. I then thought about using an XML parser, but that wouldn't work because the code isn't valid XML.

Can anyone steer me in the right direction to do this in Python (a regex solution is acceptable).

3
  • 1
    Is your XML-like format an understood standard? I've not come across it before. Commented Jul 26, 2017 at 12:43
  • @chocksaway It is for Adobe Framemaker: help.adobe.com/en_US/framemaker/mifreference/mifref.pdf Commented Jul 26, 2017 at 12:45
  • Excellent - so a standard format. Commented Jul 26, 2017 at 13:21

3 Answers 3

1

You could use a loop to remove the tag. re.finditer() can be used to iteratively extract the tags.

Check the below code and see if it works for you.

text = re.sub(r'\r|\n',' ',text)
for m in re.finditer(r'(\<Unique\s).*?\>',text):
   print m.group()
for m in re.finditer(r'(\<MathFullForm\s).*?\>',text):
   print m.group()
Sign up to request clarification or add additional context in comments.

Comments

0

You can use this regex, specifying the DOTALL flag(otherwise the . would not match the \n too):

<(Unique|MathFullForm)(.*?)>

The first capturing group says if the match belongs to the Unique or MathFullForm tag, whereas in the second you can find the content of the tag.

3 Comments

Sorry, I should have mentioned that Unique tags exist outside of Math tags as well
Why should this one be a problem for this regex?
I want the Unique number which corresponds to the MathFullForm
0

I have found the solution by using the following regex:

<Math\s*<Unique[^>]*>\s*(?:<Separation[^>]*>)*\s*(?:<ObColor[^>]*>)*\s*(?:<RunaroundGap[^>]*>)*\s*<BRect[^>]*>\s*<MathFullForm `[^']*'

This matches the whole tag, so I can use two more regexes to extract the necessary information.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.