1

Let's say I have several <a> elements in string:

s = 'Hello world. <a href="https://stackoverflow.com/">StackOverflow</a> is a great website. <a href="https://www.espn.com/">ESPN</a> is another great website.'

The goal is to split the string so I get a list similar to the one below:

l = [
    "Hello world. ",
    {"link": "https://stackoverflow.com/", "title": "StackOverflow"},
    " is a great website. ",
    {"link": "https://www.espn.com/", "title": "ESPN"},
    " is another great website.",
]

The dictionaries can be any object I can extract the link and title from. Is there a regex I can use to accomplish this? Or is there a better way to do this?

1
  • 1
    You are not supposed to use regex on HTML. Use a HTML parser. Commented Jul 11, 2019 at 6:40

2 Answers 2

6

BeautifulSoup is better tool to parse this string than regex. As general rule, don't use regex to parse HTML:

s = 'Hello world. <a href="https://stackoverflow.com/">StackOverflow</a> is a great website. <a href="https://www.espn.com/">ESPN</a> is another great website.'

from bs4 import BeautifulSoup, Tag, NavigableString

soup = BeautifulSoup(s, 'html.parser')

out = []

for c in soup.contents:
    if isinstance(c, NavigableString):
        out += [c]
    elif isinstance(c, Tag) and c.name == 'a' and 'href' in c.attrs:
        out += [{"link": c['href'], "title": c.text}]

from pprint import pprint
pprint(out)

Prints:

['Hello world. ',
 {'link': 'https://stackoverflow.com/', 'title': 'StackOverflow'},
 ' is a great website. ',
 {'link': 'https://www.espn.com/', 'title': 'ESPN'},
 ' is another great website.']
Sign up to request clarification or add additional context in comments.

3 Comments

Is it possible to recognize also link both with and without a href ? Like this one: s = 'Hello world. <a href="https://stackoverflow.com/">StackOverflow</a> is a great website and https://www.espn.com is another great website'
@itajackass HTML parser is used only to parse tags, if there aren't any tags (like in your example), other tools have to be used (for example re module)
Thanks.I need to try a good reg_exp for url. at today i've not found any good regular expression that work great..
1

If you insist on using regex for this:

import re

s = 'Hello world. <a href="https://stackoverflow.com/">StackOverflow</a> is a great website. <a href="https://www.espn.com/">ESPN</a> is another great website.'
sites = [{"link": link, "title": title} for link, title in zip(re.findall(r'<a href="(.*?)">', s), re.findall(r'>(.*?)</a>', s))]
print(sites)

Output:

[{'link': 'https://stackoverflow.com/', 'title': 'StackOverflow'}, {'link': 'https://www.espn.com/', 'title': 'ESPN'}]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.