1

I have a string which has defined tags around specific words or sub-strings. For example:

text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss> 
<hhh>10</hhh>'

How can I get the strings <xxx>ibis and the</xxx>, <ccc>NW</ccc>, <sss>Jan</sss> and <hhh>10</hhh>. These tags can be anything but the tags covering a word or few words will be similar. Also, if a start or end tag is missing, I don't want that string to be returned. For example:

text = 'Bring me to <xxx>ibis and the in NW</ccc> and the <sss>Jan</sss> 
<hhh>10</hhh>'

In this case, only <sss>Jan</sss> and <hhh>10</hhh> has to be returned.

2
  • Why is this tagged nsregularexpression? Are you running Python on iOS or something? Commented Jul 29, 2019 at 12:01
  • 1
    @Mast Corrected! Commented Jul 29, 2019 at 12:09

1 Answer 1

2

Generally, you don't want regex to parse (X)HTML (more info in this answer) Better option is using a parser. This example is with beautifulsoup:

data = '''text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss>
<hhh>10</hhh>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag.get_text(strip=True))

Prints:

ibis and the
NW
Jan
10

EDIT: To get whole tag string:

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag)

Prints:

<xxx>ibis and the</xxx>
<ccc>NW</ccc>
<sss>Jan</sss>
<hhh>10</hhh>

EDIT II: If you have list of tags to find:

list_of_tags = ['xxx', 'ccc', 'sss', 'hhh']
for tag in soup.find_all(list_of_tags):
    print(tag)

EDIT: In case of malformed HTML it's necessary to change the parser:

data = '''text = 'Bring me to <xxx>ibis and the in NW</ccc> and the <sss>Jan</sss>
<hhh>10</hhh>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

list_of_tags = ['xxx', 'ccc', 'sss', 'hhh']
for tag in soup.find_all(list_of_tags):
    if tag.find_all(list_of_tags):
        continue
    print(tag)

Prints:

<sss>Jan</sss>
<hhh>10</hhh>
Sign up to request clarification or add additional context in comments.

5 Comments

Your answer is correct. But I am sorry I should have asked my question in another way. I edited it now, could you please check it?
How to give the name of the tags ? Because when I give them as a list I am getting TypeError: unhashable type: 'list'.
@Dennis.M You use method find_all() See my answer. The documentation for BeautifulSoup can be found here crummy.com/software/BeautifulSoup/bs4/doc
How to avoid parsing the string if one of the tags is missing?
@Dennis.M What do you mean? Do you want to return all tags or nothing if just one is missing?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.