0

In the webpage that I'm scraping, there are a lot of titles and I need to identify them to set one value in my database. The problem is that those titles doesn't have a specific ID or Class.

They follow those pattern:

<p ALIGN="CENTER"><font face="Arial" SIZE="2">
<a name="tituloivcapituloisecaoii"></a><b>
<span style="text-transform: uppercase">Seção II<br>
DAS ATRIBUIÇÕES DO CONGRESSO NACIONAL</span></b></font></p>


<p ALIGN="CENTER"><font face="Arial" SIZE="2"><a name="tituloivcapituloisecaoiii"></a>
<b><span style="text-transform: uppercase">Seção III<br>
DA CÂMARA DOS DEPUTADOS</span></b></font></p>

One attribute that identifies them is: text-trasform: uppercase.

How can I check if the p contains one title?

That's my current code:

soup = BeautifulSoup(f, 'html.parser')
for tag in soup.findAll():
    if tag.name in ['a', 'strike']:
      tag.decompose()

allp = soup.findAll('p')
for p in allp:          
   print(p)

1 Answer 1

2

Once you have parsed the html by tag type, you can search within the tags using any defining attribute. The text-transform:uppercase can be used in this case.

soup = BeautifulSoup(f, 'html.parser')
for p in soup.find_all("p"):
    if p.span["style"]=="text-transform: uppercase":
        title=p.text
        print(title)

>>>Seção IIDAS ATRIBUIÇÕES DO CONGRESSO NACIONAL

This will find all <p> tags containing <span> tags where style=="text-transform: uppercase" and print their associated text.

Sign up to request clarification or add additional context in comments.

6 Comments

It didn't work. I edited my question with my current code to u take a look if there's any problem. When I follow your suggestion, nothing was returned.
maybe it's happening because the text-transform is an attribute of span
Ok, you are right. change that to if p.span["style"]=="text-transform: uppercase":. I'll update it in the answer as well.
I just did a test on the strings you provided and it works. This is a different problem. If you are getting that error, what it means is that there is nothing in the rest of your code to deal with <p> tags that don't have <span> tags associated with them. The code above will fix your current problem but you need to account for the fact that not all the tags you search will have a <span> tag when you incorperate this into code to search an actual page. If you include if p.span != None: in the top line of your for loop, this will filter out None types.
Awesome! Good luck on the scraper!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.