python regex to split string at <a> elements and extract link + text

Question

Let's say I have several <a> elements in string:

s = 'Hello world. <a href="https://stackoverflow.com/">StackOverflow</a> is a great website. <a href="https://www.espn.com/">ESPN</a> is another great website.'

The goal is to split the string so I get a list similar to the one below:

l = [
    "Hello world. ",
    {"link": "https://stackoverflow.com/", "title": "StackOverflow"},
    " is a great website. ",
    {"link": "https://www.espn.com/", "title": "ESPN"},
    " is another great website.",
]

The dictionaries can be any object I can extract the link and title from. Is there a regex I can use to accomplish this? Or is there a better way to do this?

You are not supposed to use regex on HTML. Use a HTML parser. — Tomalak
– Tomalak, Commented Jul 11, 2019 at 6:40

Andrej Kesely · Accepted Answer · 2019-07-11 06:41:21Z

6

BeautifulSoup is better tool to parse this string than regex. As general rule, don't use regex to parse HTML:

s = 'Hello world. <a href="https://stackoverflow.com/">StackOverflow</a> is a great website. <a href="https://www.espn.com/">ESPN</a> is another great website.'

from bs4 import BeautifulSoup, Tag, NavigableString

soup = BeautifulSoup(s, 'html.parser')

out = []

for c in soup.contents:
    if isinstance(c, NavigableString):
        out += [c]
    elif isinstance(c, Tag) and c.name == 'a' and 'href' in c.attrs:
        out += [{"link": c['href'], "title": c.text}]

from pprint import pprint
pprint(out)

Prints:

['Hello world. ',
 {'link': 'https://stackoverflow.com/', 'title': 'StackOverflow'},
 ' is a great website. ',
 {'link': 'https://www.espn.com/', 'title': 'ESPN'},
 ' is another great website.']

answered Jul 11, 2019 at 6:41

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

itajackass Over a year ago

Is it possible to recognize also link both with and without a href ? Like this one:

s = 'Hello world. <a href="https://stackoverflow.com/">StackOverflow</a> is a great website and https://www.espn.com is another great website'

Andrej Kesely Over a year ago

@itajackass HTML parser is used only to parse tags, if there aren't any tags (like in your example), other tools have to be used (for example re module)

itajackass Over a year ago

Thanks.I need to try a good reg_exp for url. at today i've not found any good regular expression that work great..

ruohola · Accepted Answer · 2019-07-11 06:43:08Z

1

If you insist on using regex for this:

import re

s = 'Hello world. <a href="https://stackoverflow.com/">StackOverflow</a> is a great website. <a href="https://www.espn.com/">ESPN</a> is another great website.'
sites = [{"link": link, "title": title} for link, title in zip(re.findall(r'<a href="(.*?)">', s), re.findall(r'>(.*?)</a>', s))]
print(sites)

Output:

[{'link': 'https://stackoverflow.com/', 'title': 'StackOverflow'}, {'link': 'https://www.espn.com/', 'title': 'ESPN'}]

answered Jul 11, 2019 at 6:43

ruohola

24.8k7 gold badges76 silver badges118 bronze badges

Collectives™ on Stack Overflow

python regex to split string at <a> elements and extract link + text

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related