0

I am trying to extract some attribute information from HTML file using BeautifulSoup. Below is the sample HTML and code I have tried.

<div id="rp_NaNnetSales" class="finsummary_nlptext add2Margin account nlpmain" style="display: inline;">
  <div class="add2Margin account nlpremark"><br><br>
    <div>Segment revenue and results</div>
    <div></div>
  </div>
  <div class="add2Margin account nlpremark">This is my my revenue&nbsp;</div>
  <div class="add2Margin account nlpremark">As a result, the Group turned in a respectable revenue of S$3,484.6 million for the financial year ended 31 December 2018 (' FY 2018'). &nbsp; Although FY 2018 revenue was 13.0% lower year- on- year, Venture attained a compounded annual growth rate
    of 8.4% over the period from FY 2013 to FY 2018. ---- P11
  </div>
</div>
<div id="rp_grossProfit" class="add2Margin account rationmain"><span class="ratio_name "><b>Gross Profit</b> increased by 191.3% to  SGD 2,625,295.0 mil in FY18 (FY17: SGD 901,244.0 mil)</span>
</div>
<div id="rp_NaNgrossProfit" class="finsummary_nlptext add2Margin account nlpmain" style="display: inline;"></div>
<div id="rp_grossProfitMarginPercentage" class="add2Margin account rationmain"><span class="ratio_name "><b>GP margin</b> was stable at  100.0%  in FY18 (FY17:  100.0% )</span>
</div>
I want to extract all text information along with it's "id" data (preferably in Dataframe format) but I am unable to capture "id" information. Below is the code I tried :

with open(html_file_location, 'r') as f:
contents = f.read()
soup1 = BeautifulSoup(contents, features='lxml')

for child1 in soup1.recursiveChildGenerator():
if child1.name == "div":
    for tag in child1.find_all("div"):
        print(f'{tag.name}: {tag.text}')
        print(f'{tag.name}: {tag.id}')

"tag.id" is incorrect but I am not sure how to correct it.

1 Answer 1

1

Two issues in your code:

  1. The div elements you're querying do not have any id attributes. (the children of the first div element)
  2. You need to use .get("id") to access the id attribute - .id is interpreted as .find('id'), which would return None

Here's a working example:

from bs4 import BeautifulSoup

html = '''
<div id="rp_NaNnetSales" class="finsummary_nlptext add2Margin account nlpmain" style="display: inline;">
  <div class="add2Margin account nlpremark"><br><br>
    <div>Segment revenue and results</div>
    <div></div>
  </div>
  <div class="add2Margin account nlpremark">This is my my revenue&nbsp;</div>
  <div class="add2Margin account nlpremark">As a result, the Group turned in a respectable revenue of S$3,484.6 million for the financial year ended 31 December 2018 (' FY 2018'). &nbsp; Although FY 2018 revenue was 13.0% lower year- on- year, Venture attained a compounded annual growth rate
    of 8.4% over the period from FY 2013 to FY 2018. ---- P11
  </div>
</div>
<div id="rp_grossProfit" class="add2Margin account rationmain"><span class="ratio_name "><b>Gross Profit</b> increased by 191.3% to  SGD 2,625,295.0 mil in FY18 (FY17: SGD 901,244.0 mil)</span>
</div>
<div id="rp_NaNgrossProfit" class="finsummary_nlptext add2Margin account nlpmain" style="display: inline;"></div>
<div id="rp_grossProfitMarginPercentage" class="add2Margin account rationmain"><span class="ratio_name "><b>GP margin</b> was stable at  100.0%  in FY18 (FY17:  100.0% )</span>
</div>
'''

soup1 = BeautifulSoup(html, 'lxml')

for child1 in soup1.recursiveChildGenerator():
    if child1.name == "div":
        # for tag in child1.find_all("div"):
        # print(f'{child1.name}: {child1.text}')
        print(f'{child1.name}: {child1.get("id")}')

Output:

div: rp_NaNnetSales
div: None
div: None
div: None
div: None
div: None
div: rp_grossProfit
div: rp_NaNgrossProfit
div: rp_grossProfitMarginPercentage
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.