How to find multiple strings between tag/sub-strings?

Question

I have a string which has defined tags around specific words or sub-strings. For example:

text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss> 
<hhh>10</hhh>'

How can I get the strings <xxx>ibis and the</xxx>, <ccc>NW</ccc>, <sss>Jan</sss> and <hhh>10</hhh>. These tags can be anything but the tags covering a word or few words will be similar. Also, if a start or end tag is missing, I don't want that string to be returned. For example:

text = 'Bring me to <xxx>ibis and the in NW</ccc> and the <sss>Jan</sss> 
<hhh>10</hhh>'

In this case, only <sss>Jan</sss> and <hhh>10</hhh> has to be returned.

Why is this tagged nsregularexpression? Are you running Python on iOS or something? — Mast
– Mast, Commented Jul 29, 2019 at 12:01

Andrej Kesely · Accepted Answer · 2019-07-31 08:23:31Z

2

Generally, you don't want regex to parse (X)HTML (more info in this answer) Better option is using a parser. This example is with beautifulsoup:

data = '''text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss>
<hhh>10</hhh>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag.get_text(strip=True))

Prints:

ibis and the
NW
Jan
10

EDIT: To get whole tag string:

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag)

Prints:

<xxx>ibis and the</xxx>
<ccc>NW</ccc>
<sss>Jan</sss>
<hhh>10</hhh>

EDIT II: If you have list of tags to find:

list_of_tags = ['xxx', 'ccc', 'sss', 'hhh']
for tag in soup.find_all(list_of_tags):
    print(tag)

EDIT: In case of malformed HTML it's necessary to change the parser:

data = '''text = 'Bring me to <xxx>ibis and the in NW</ccc> and the <sss>Jan</sss>
<hhh>10</hhh>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

list_of_tags = ['xxx', 'ccc', 'sss', 'hhh']
for tag in soup.find_all(list_of_tags):
    if tag.find_all(list_of_tags):
        continue
    print(tag)

Prints:

<sss>Jan</sss>
<hhh>10</hhh>

edited Jul 31, 2019 at 8:23

answered Jul 26, 2019 at 11:48

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

idkman Over a year ago

Your answer is correct. But I am sorry I should have asked my question in another way. I edited it now, could you please check it?

idkman Over a year ago

How to give the name of the tags ? Because when I give them as a list I am getting TypeError: unhashable type: 'list'.

Andrej Kesely Over a year ago

@Dennis.M You use method find_all() See my answer. The documentation for BeautifulSoup can be found here crummy.com/software/BeautifulSoup/bs4/doc

idkman Over a year ago

How to avoid parsing the string if one of the tags is missing?

Andrej Kesely Over a year ago

@Dennis.M What do you mean? Do you want to return all tags or nothing if just one is missing?

Collectives™ on Stack Overflow

How to find multiple strings between tag/sub-strings?

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related