Using regex to extract hyperlink text from Python string

Question

I am trying to extract the display text for each hyperlink in a giant string. (The string is obtained by opening and reading an .rtf file, and the file has many hyperlinks.) The hyperlinks are generally in the format {\field{\*\fldinst HYPERLINK "http://www.mywebsite.com/"}{\fldrslt Click Here}} (I want Click Here), but often contain a lot of nested formatting with newlines:

Example 1 (I want to extract Leonard T. Strand): text I don't want {\\field {\\*\\fldinst HYPERLINK "http://www.westlaw.com/Link/Document/FullText?findType=h&pubNum=176284&cite=0226771601&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RQ&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)" }{\\fldrslt \n{\\b0 \\cf5 \\f2 \\ul0 \\strike0 \\i0 \\fs20 \\sa0 \\sb0 \nLeonard T. Strand\n}}} text I don't want

Example 2 (I want to extract Morgan v. Robinson and 920 F.3d 521, 523 (8th Cir. 2019): text I don't want {\\field {\\*\\fldinst HYPERLINK "http://www.westlaw.com/Link/Document/FullText?findType=Y&serNum=2047938005&pubNum=0000506&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RP&fi=co_pp_sp_506_523&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)#co_pp_sp_506_523" }{\\fldrslt \n{\\b0 \\cf5 \\f2 \\i1 \\fs20 \n{\\b0 \\cf5 \\f2 \\ul0 \\strike0 \\i1 \\fs20 \\sa0 \\sb0 \nMorgan v. Robinson\n}\n}\n{\\b0 \\cf5 \\f2 \\ul0 \\strike0 \\i0 \\fs20 \\sa0 \\sb0 \n, 920 F.3d 521, 523 (8th Cir. 2019)\n}}} text I don't want

This works for the first type but not for the second: regex = re.compile('\n?\}?\n\{\\\\field.*\\\\fldrslt \n.*\n(.*)\n') Ideally, I'd like something more generalizeable that fits the broad structure of the hyperlink, but the multiple text locations in example 2 are giving me problems.

Why don’t you match ‘HYPERLINK\s+“(http.*?)”’ and then use group 1? — DisappointedByUnaccountableMod
– DisappointedByUnaccountableMod, Commented Nov 20, 2020 at 22:52

The fourth bird · Accepted Answer · 2020-11-20 23:33:33Z

Looking at the example data, you might use a specific match for the field and fldinst part. Then after fldinst match the rest of the line, followed by all the lines that do start with {

Then capture the all following lines in group 1 until you encounter }}}

Then from capture group 1, remove all lines that start with either { or } or a comma.

Note that this is based on the example data, and does not take balanced curly brackets into account.

Pattern to get group 1

{\\\\field\s*{\\\\\*\\\\fldinst HYPERLINK\s+"https?://[^"]+"\s+}{\\\\fldrslt.*\r?\n((?:(?!}}}).*\r?\n)*)}}}

About the pattern

{\\\\field\s*{\\\\\*\\\\fldinst HYPERLINK\s+"https?://[^"]+"\s+} Match the field and the HYPERLINK part
{\\\\fldrslt.*\r?\n Match the fldrslt part
( Capture group 1
- (?:(?!}}}).*\r?\n)* Repeat matching all lines that do not start with }}}
) Close group 1
}}} Match ending }}}

Regex demo

Pattern to remove all unwanted lines from group 1

^(?:[{}].*[\r\n]*|,[^\S\r\n]*)

^ Start of string
(?: Non capture group
- [{}].*[\r\n]* Match a line start starts with { or }
- | Or
- ,[^\S\r\n]* Match a , followed by optional whitespace chars without a newline
) Close group

Regex demo

Example code

import re
 
regex = r"{\\\\field\s*{\\\\\*\\\\fldinst HYPERLINK\s+\"https?://[^\"]+\"\s+}{\\\\fldrslt.*\r?\n((?:(?!}}}).*\r?\n)*)}}}"
 
test_str = ("text I don't want {\\\\field {\\\\*\\\\fldinst HYPERLINK \"http://w...content-available-to-author-only...w.com/Link/Document/FullText?findType=Y&serNum=2047938005&pubNum=0000506&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RP&fi=co_pp_sp_506_523&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)#co_pp_sp_506_523\" }{\\\\fldrslt \n"
            "{\\\\b0 \\\\cf5 \\\\f2 \\\\i1 \\\\fs20 \n"
            "{\\\\b0 \\\\cf5 \\\\f2 \\\\ul0 \\\\strike0 \\\\i1 \\\\fs20 \\\\sa0 \\\\sb0 \n"
            "Morgan v. Robinson\n"
            "}\n"
            "}\n"
            "{\\\\b0 \\\\cf5 \\\\f2 \\\\ul0 \\\\strike0 \\\\i0 \\\\fs20 \\\\sa0 \\\\sb0 \n"
            ", 920 F.3d 521, 523 (8th Cir. 2019)\n"
            "}}} text I don't want\n\n"
            "text I don't want {\\\\field {\\\\*\\\\fldinst HYPERLINK \"http://w...content-available-to-author-only...w.com/Link/Document/FullText?findType=h&pubNum=176284&cite=0226771601&originatingDoc=I2e197170e0a011eaa13ca2bed92d37fc&refType=RQ&originationContext=document&vr=3.0&rs=cblt1.0&transitionType=DocumentItem&contextData=(sc.Search)\" }{\\\\fldrslt \n"
            "{\\\\b0 \\\\cf5 \\\\f2 \\\\ul0 \\\\strike0 \\\\i0 \\\\fs20 \\\\sa0 \\\\sb0 \n"
            "Leonard T. Strand\n"
            "}}} text I don't want")
 
for g in re.findall(regex, test_str):
    print(re.sub(r"^(?:[{}].*[\r\n]*|,[^\S\r\n]*)", "", g, 0, re.MULTILINE))

Output

Morgan v. Robinson
920 F.3d 521, 523 (8th Cir. 2019)

Leonard T. Strand

Python demo

Collectives™ on Stack Overflow

Using regex to extract hyperlink text from Python string

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related