I am trying to extract the display text for each hyperlink in a giant string. (The string is obtained by opening and reading an .rtf file, and the file has many hyperlinks.) The hyperlinks are generally in the format {\field{\*\fldinst HYPERLINK "http://www.mywebsite.com/"}{\fldrslt Click Here}} (I want Click Here), but often contain a lot of nested formatting with newlines:
Example 1 (I want to extract Leonard T. Strand): text I don't want {\\field {\\*\\fldinst HYPERLINK "http://www.westlaw.com/Link/Document/FullText?findType=h&pubNum=176284&cite=0226771601&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RQ&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)" }{\\fldrslt \n{\\b0 \\cf5 \\f2 \\ul0 \\strike0 \\i0 \\fs20 \\sa0 \\sb0 \nLeonard T. Strand\n}}} text I don't want
Example 2 (I want to extract Morgan v. Robinson and 920 F.3d 521, 523 (8th Cir. 2019): text I don't want {\\field {\\*\\fldinst HYPERLINK "http://www.westlaw.com/Link/Document/FullText?findType=Y&serNum=2047938005&pubNum=0000506&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RP&fi=co_pp_sp_506_523&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)#co_pp_sp_506_523" }{\\fldrslt \n{\\b0 \\cf5 \\f2 \\i1 \\fs20 \n{\\b0 \\cf5 \\f2 \\ul0 \\strike0 \\i1 \\fs20 \\sa0 \\sb0 \nMorgan v. Robinson\n}\n}\n{\\b0 \\cf5 \\f2 \\ul0 \\strike0 \\i0 \\fs20 \\sa0 \\sb0 \n, 920 F.3d 521, 523 (8th Cir. 2019)\n}}} text I don't want
This works for the first type but not for the second: regex = re.compile('\n?\}?\n\{\\\\field.*\\\\fldrslt \n.*\n(.*)\n') Ideally, I'd like something more generalizeable that fits the broad structure of the hyperlink, but the multiple text locations in example 2 are giving me problems.