Strip HTML tags to get strings in python

Question

I tried to get some strings from an HTML file with BeautifulSoup and everytime I work with it I get partial results.

I want to get the strings in every li element/tag. So far I've been able to get everything in ul like this.

#!/usr/bin/python
from bs4 import BeautifulSoup
page = open("page.html")
soup = BeautifulSoup(page)
source = soup.select(".sidebar li")

And what I get is this:

[<li class="first">
        Def Leppard -  Make Love Like A Man<span>Live</span> </li>, <li>
        Inxs - Never Tear Us Apart        </li>, <li>
        Gary Moore - Over The Hills And Far Away        </li>, <li>
        Linkin Park -  Numb        </li>, <li>
        Vita De Vie -  Basul Si Cu Toba Mare        </li>, <li>
        Nazareth - Love Hurts        </li>, <li>
        U2 - I Still Haven't Found What I'm L        </li>, <li>
        Blink 182 -  All The Small Things        </li>, <li>
        Scorpions -  Wind Of Change        </li>, <li>
        Iggy Pop - The Passenger        </li>]

I want to get only the strings from this.

Have you solved the issue? Did any of the answer help? If yes, choose one and accept. Thanks. — alecxe
– alecxe, Commented Sep 20, 2014 at 1:48

Craicerjack · Accepted Answer · 2014-04-07 13:53:35Z

2

Use beautiful soups - .strings method.

for string in soup.stripped_strings:
print(repr(string))

from the docs:

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

or

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:

answered Apr 7, 2014 at 13:53

Craicerjack

6,3303 gold badges34 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

alecxe · Accepted Answer · 2014-04-07 14:56:22Z

1

Iterate over results and get the value of text attribute:

for element in soup.select(".sidebar li"):
    print element.text

Example:

from bs4 import BeautifulSoup


data = """
<body>
    <ul>
        <li class="first">Def Leppard -  Make Love Like A Man<span>Live</span> </li>
        <li>Inxs - Never Tear Us Apart        </li>
    </ul>
</body>
"""

soup = BeautifulSoup(data)
for element in soup.select('li'):
    print element.text

prints:

Def Leppard -  Make Love Like A ManLive 
Inxs - Never Tear Us Apart

edited Apr 7, 2014 at 14:56

answered Apr 7, 2014 at 13:50

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

2 Comments

cbomb Over a year ago

This works pretty fine, but on the first line I also have <span>Live</span> which I'd like to get rid of.

alecxe Over a year ago

@cbomb text can handle this and extracts the text from all nested tags, see the example I've provided. Hope it helps.

Andy · Accepted Answer · 2014-04-07 13:55:08Z

0

This example from the documentation gives a very nice one liner.

''.join(BeautifulSoup(source).findAll(text=True))

answered Apr 7, 2014 at 13:55

Andy♦

50.8k62 gold badges181 silver badges240 bronze badges

Collectives™ on Stack Overflow

Strip HTML tags to get strings in python

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related