Python_BeautifulSoup : Extracting attributes data from html file

Question

I am trying to extract some attribute information from HTML file using BeautifulSoup. Below is the sample HTML and code I have tried.

<div id="rp_NaNnetSales" class="finsummary_nlptext add2Margin account nlpmain" style="display: inline;">
  <div class="add2Margin account nlpremark"><br><br>
    <div>Segment revenue and results</div>
    <div></div>
  </div>
  <div class="add2Margin account nlpremark">This is my my revenue&nbsp;</div>
  <div class="add2Margin account nlpremark">As a result, the Group turned in a respectable revenue of S$3,484.6 million for the financial year ended 31 December 2018 (' FY 2018'). &nbsp; Although FY 2018 revenue was 13.0% lower year- on- year, Venture attained a compounded annual growth rate
    of 8.4% over the period from FY 2013 to FY 2018. ---- P11
  </div>
</div>
<div id="rp_grossProfit" class="add2Margin account rationmain"><span class="ratio_name "><b>Gross Profit</b> increased by 191.3% to  SGD 2,625,295.0 mil in FY18 (FY17: SGD 901,244.0 mil)</span>
</div>
<div id="rp_NaNgrossProfit" class="finsummary_nlptext add2Margin account nlpmain" style="display: inline;"></div>
<div id="rp_grossProfitMarginPercentage" class="add2Margin account rationmain"><span class="ratio_name "><b>GP margin</b> was stable at  100.0%  in FY18 (FY17:  100.0% )</span>
</div>

I want to extract all text information along with it's "id" data (preferably in Dataframe format) but I am unable to capture "id" information. Below is the code I tried :

with open(html_file_location, 'r') as f:
contents = f.read()
soup1 = BeautifulSoup(contents, features='lxml')

for child1 in soup1.recursiveChildGenerator():
if child1.name == "div":
    for tag in child1.find_all("div"):
        print(f'{tag.name}: {tag.text}')
        print(f'{tag.name}: {tag.id}')

"tag.id" is incorrect but I am not sure how to correct it.

readyplayer77 · Accepted Answer · 2021-04-05 16:22:18Z

Two issues in your code:

The div elements you're querying do not have any id attributes. (the children of the first div element)
You need to use .get("id") to access the id attribute - .id is interpreted as .find('id'), which would return None

Here's a working example:

from bs4 import BeautifulSoup

html = '''
<div id="rp_NaNnetSales" class="finsummary_nlptext add2Margin account nlpmain" style="display: inline;">
  <div class="add2Margin account nlpremark"><br><br>
    <div>Segment revenue and results</div>
    <div></div>
  </div>
  <div class="add2Margin account nlpremark">This is my my revenue&nbsp;</div>
  <div class="add2Margin account nlpremark">As a result, the Group turned in a respectable revenue of S$3,484.6 million for the financial year ended 31 December 2018 (' FY 2018'). &nbsp; Although FY 2018 revenue was 13.0% lower year- on- year, Venture attained a compounded annual growth rate
    of 8.4% over the period from FY 2013 to FY 2018. ---- P11
  </div>
</div>
<div id="rp_grossProfit" class="add2Margin account rationmain"><span class="ratio_name "><b>Gross Profit</b> increased by 191.3% to  SGD 2,625,295.0 mil in FY18 (FY17: SGD 901,244.0 mil)</span>
</div>
<div id="rp_NaNgrossProfit" class="finsummary_nlptext add2Margin account nlpmain" style="display: inline;"></div>
<div id="rp_grossProfitMarginPercentage" class="add2Margin account rationmain"><span class="ratio_name "><b>GP margin</b> was stable at  100.0%  in FY18 (FY17:  100.0% )</span>
</div>
'''

soup1 = BeautifulSoup(html, 'lxml')

for child1 in soup1.recursiveChildGenerator():
    if child1.name == "div":
        # for tag in child1.find_all("div"):
        # print(f'{child1.name}: {child1.text}')
        print(f'{child1.name}: {child1.get("id")}')

Output:

div: rp_NaNnetSales
div: None
div: None
div: None
div: None
div: None
div: rp_grossProfit
div: rp_NaNgrossProfit
div: rp_grossProfitMarginPercentage

Collectives™ on Stack Overflow

Python_BeautifulSoup : Extracting attributes data from html file

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related