2

I tried to get some strings from an HTML file with BeautifulSoup and everytime I work with it I get partial results.

I want to get the strings in every li element/tag. So far I've been able to get everything in ul like this.

#!/usr/bin/python
from bs4 import BeautifulSoup
page = open("page.html")
soup = BeautifulSoup(page)
source = soup.select(".sidebar li")

And what I get is this:

[<li class="first">
        Def Leppard -  Make Love Like A Man<span>Live</span> </li>, <li>
        Inxs - Never Tear Us Apart        </li>, <li>
        Gary Moore - Over The Hills And Far Away        </li>, <li>
        Linkin Park -  Numb        </li>, <li>
        Vita De Vie -  Basul Si Cu Toba Mare        </li>, <li>
        Nazareth - Love Hurts        </li>, <li>
        U2 - I Still Haven't Found What I'm L        </li>, <li>
        Blink 182 -  All The Small Things        </li>, <li>
        Scorpions -  Wind Of Change        </li>, <li>
        Iggy Pop - The Passenger        </li>]

I want to get only the strings from this.

1
  • Have you solved the issue? Did any of the answer help? If yes, choose one and accept. Thanks. Commented Sep 20, 2014 at 1:48

3 Answers 3

2

Use beautiful soups - .strings method.

for string in soup.stripped_strings:
print(repr(string))

from the docs:

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

or

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:

Sign up to request clarification or add additional context in comments.

Comments

1

Iterate over results and get the value of text attribute:

for element in soup.select(".sidebar li"):
    print element.text

Example:

from bs4 import BeautifulSoup


data = """
<body>
    <ul>
        <li class="first">Def Leppard -  Make Love Like A Man<span>Live</span> </li>
        <li>Inxs - Never Tear Us Apart        </li>
    </ul>
</body>
"""

soup = BeautifulSoup(data)
for element in soup.select('li'):
    print element.text

prints:

Def Leppard -  Make Love Like A ManLive 
Inxs - Never Tear Us Apart        

2 Comments

This works pretty fine, but on the first line I also have <span>Live</span> which I'd like to get rid of.
@cbomb text can handle this and extracts the text from all nested tags, see the example I've provided. Hope it helps.
0

This example from the documentation gives a very nice one liner.

''.join(BeautifulSoup(source).findAll(text=True))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.