1

I am trying to extract the display text for each hyperlink in a giant string. (The string is obtained by opening and reading an .rtf file, and the file has many hyperlinks.) The hyperlinks are generally in the format {\field{\*\fldinst HYPERLINK "http://www.mywebsite.com/"}{\fldrslt Click Here}} (I want Click Here), but often contain a lot of nested formatting with newlines:

Example 1 (I want to extract Leonard T. Strand): text I don't want {\\field {\\*\\fldinst HYPERLINK "http://www.westlaw.com/Link/Document/FullText?findType=h&pubNum=176284&cite=0226771601&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RQ&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)" }{\\fldrslt \n{\\b0 \\cf5 \\f2 \\ul0 \\strike0 \\i0 \\fs20 \\sa0 \\sb0 \nLeonard T. Strand\n}}} text I don't want

Example 2 (I want to extract Morgan v. Robinson and 920 F.3d 521, 523 (8th Cir. 2019): text I don't want {\\field {\\*\\fldinst HYPERLINK "http://www.westlaw.com/Link/Document/FullText?findType=Y&serNum=2047938005&pubNum=0000506&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RP&fi=co_pp_sp_506_523&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)#co_pp_sp_506_523" }{\\fldrslt \n{\\b0 \\cf5 \\f2 \\i1 \\fs20 \n{\\b0 \\cf5 \\f2 \\ul0 \\strike0 \\i1 \\fs20 \\sa0 \\sb0 \nMorgan v. Robinson\n}\n}\n{\\b0 \\cf5 \\f2 \\ul0 \\strike0 \\i0 \\fs20 \\sa0 \\sb0 \n, 920 F.3d 521, 523 (8th Cir. 2019)\n}}} text I don't want

This works for the first type but not for the second: regex = re.compile('\n?\}?\n\{\\\\field.*\\\\fldrslt \n.*\n(.*)\n') Ideally, I'd like something more generalizeable that fits the broad structure of the hyperlink, but the multiple text locations in example 2 are giving me problems.

1
  • Why don’t you match ‘HYPERLINK\s+“(http.*?)”’ and then use group 1? Commented Nov 20, 2020 at 22:52

1 Answer 1

1

Looking at the example data, you might use a specific match for the field and fldinst part. Then after fldinst match the rest of the line, followed by all the lines that do start with {

Then capture the all following lines in group 1 until you encounter }}}

Then from capture group 1, remove all lines that start with either { or } or a comma.

Note that this is based on the example data, and does not take balanced curly brackets into account.

Pattern to get group 1

{\\\\field\s*{\\\\\*\\\\fldinst HYPERLINK\s+"https?://[^"]+"\s+}{\\\\fldrslt.*\r?\n((?:(?!}}}).*\r?\n)*)}}}

About the pattern

  • {\\\\field\s*{\\\\\*\\\\fldinst HYPERLINK\s+"https?://[^"]+"\s+} Match the field and the HYPERLINK part
  • {\\\\fldrslt.*\r?\n Match the fldrslt part
  • ( Capture group 1
    • (?:(?!}}}).*\r?\n)* Repeat matching all lines that do not start with }}}
  • ) Close group 1
  • }}} Match ending }}}

Regex demo

Pattern to remove all unwanted lines from group 1

^(?:[{}].*[\r\n]*|,[^\S\r\n]*)
  • ^ Start of string
  • (?: Non capture group
    • [{}].*[\r\n]* Match a line start starts with { or }
    • | Or
    • ,[^\S\r\n]* Match a , followed by optional whitespace chars without a newline
  • ) Close group

Regex demo

Example code

import re
 
regex = r"{\\\\field\s*{\\\\\*\\\\fldinst HYPERLINK\s+\"https?://[^\"]+\"\s+}{\\\\fldrslt.*\r?\n((?:(?!}}}).*\r?\n)*)}}}"
 
test_str = ("text I don't want {\\\\field {\\\\*\\\\fldinst HYPERLINK \"http://w...content-available-to-author-only...w.com/Link/Document/FullText?findType=Y&serNum=2047938005&pubNum=0000506&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RP&fi=co_pp_sp_506_523&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)#co_pp_sp_506_523\" }{\\\\fldrslt \n"
            "{\\\\b0 \\\\cf5 \\\\f2 \\\\i1 \\\\fs20 \n"
            "{\\\\b0 \\\\cf5 \\\\f2 \\\\ul0 \\\\strike0 \\\\i1 \\\\fs20 \\\\sa0 \\\\sb0 \n"
            "Morgan v. Robinson\n"
            "}\n"
            "}\n"
            "{\\\\b0 \\\\cf5 \\\\f2 \\\\ul0 \\\\strike0 \\\\i0 \\\\fs20 \\\\sa0 \\\\sb0 \n"
            ", 920 F.3d 521, 523 (8th Cir. 2019)\n"
            "}}} text I don't want\n\n"
            "text I don't want {\\\\field {\\\\*\\\\fldinst HYPERLINK \"http://w...content-available-to-author-only...w.com/Link/Document/FullText?findType=h&pubNum=176284&cite=0226771601&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RQ&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)\" }{\\\\fldrslt \n"
            "{\\\\b0 \\\\cf5 \\\\f2 \\\\ul0 \\\\strike0 \\\\i0 \\\\fs20 \\\\sa0 \\\\sb0 \n"
            "Leonard T. Strand\n"
            "}}} text I don't want")
 
for g in re.findall(regex, test_str):
    print(re.sub(r"^(?:[{}].*[\r\n]*|,[^\S\r\n]*)", "", g, 0, re.MULTILINE))

Output

Morgan v. Robinson
920 F.3d 521, 523 (8th Cir. 2019)

Leonard T. Strand

Python demo

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.