web scraping using nested for loops, BeautifulSoup in python3

Question

I am scraping a html file. I wrote the following code.

with open('Basic Materials.htm') as fp:
    soup=BeautifulSoup(fp,'lxml')
    table=soup.find('div',{'class':'sfe-break-bottom'})
    for row in table.find_all('tr'):
        cells=row.find_all('td')
        print(cells)

Now the output for the print(cells) is given below:

[<td colspan="2" style="text-align:left"><b>Gainers (% price change)</b>
</td>, <td width="15%">Last Trade
</td>, <td width="20%">Change
</td>, <td width="15%">
Mkt Cap
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:GFI&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Gold Fields Limited (ADR)</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:GFI&amp;ei=H7pKWbBtgoabAZ7Kv7gI">GFI</a>
</td>, <td>3.53
</td>, <td width="20%">
<span class="chg">+0.11</span>
<span class="chg">(3.22%)</span>
</td>, <td>2.84B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:VALE&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Vale SA (ADR)</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:VALE&amp;ei=H7pKWbBtgoabAZ7Kv7gI">VALE</a>
</td>, <td>7.94
</td>, <td width="20%">
<span class="chg">+0.17</span>
<span class="chg">(2.19%)</span>
</td>, <td>39.61B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:CLF&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Cliffs Natural Resources</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:CLF&amp;ei=H7pKWbBtgoabAZ7Kv7gI">CLF</a>
</td>, <td>5.97
</td>, <td width="20%">
<span class="chg">+0.12</span>
<span class="chg">(2.14%)</span>
</td>, <td>1.69B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:AUY&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Yamana Gold Inc. (USA)</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:AUY&amp;ei=H7pKWbBtgoabAZ7Kv7gI">AUY</a>
</td>, <td>2.40
</td>, <td width="20%">
<span class="chg">+0.05</span>
<span class="chg">(1.91%)</span>
</td>, <td>2.27B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:HL&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Hecla Mining Company</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:HL&amp;ei=H7pKWbBtgoabAZ7Kv7gI">HL</a>
</td>, <td>5.20
</td>, <td width="20%">
<span class="chg">+0.09</span>
<span class="chg">(1.86%)</span>
</td>, <td>2.03B
</td>]
[<td colspan="2" style="text-align:left"><b>Losers (% price change)</b>
</td>, <td colspan="3">
</td>]
[<td style="text-align:left;">
<a href="/finance?cid=717954&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Jaguar Mining Inc (USA)</a>
</td>, <td style="text-align:left;">
<a href="/finance?cid=717954&amp;ei=H7pKWbBtgoabAZ7Kv7gI"></a>
</td>, <td>11.92
</td>, <td width="20%">
<span class="chr">-0.74</span>
<span class="chr">(-5.85%)</span>
</td>, <td>2.52B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:OLN&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Olin Corporation</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:OLN&amp;ei=H7pKWbBtgoabAZ7Kv7gI">OLN</a>
</td>, <td>28.64
</td>, <td width="20%">
<span class="chr">-1.52</span>
<span class="chr">(-5.04%)</span>
</td>, <td>4.81B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NASDAQ:GPRE&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Green Plains Inc</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NASDAQ:GPRE&amp;ei=H7pKWbBtgoabAZ7Kv7gI">GPRE</a>
</td>, <td>19.12
</td>, <td width="20%">
<span class="chr">-0.98</span>
<span class="chr">(-4.85%)</span>
</td>, <td>708.77M
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:IPI&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Intrepid Potash, Inc.</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:IPI&amp;ei=H7pKWbBtgoabAZ7Kv7gI">IPI</a>
</td>, <td>2.09
</td>, <td width="20%">
<span class="chr">-0.09</span>
<span class="chr">(-4.13%)</span>
</td>, <td>261.35M
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NASDAQ:CENX&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Century Aluminum Co</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NASDAQ:CENX&amp;ei=H7pKWbBtgoabAZ7Kv7gI">CENX</a>
</td>, <td>13.62
</td>, <td width="20%">
<span class="chr">-0.56</span>
<span class="chr">(-3.95%)</span>
</td>, <td>1.17B
</td>]
[<td colspan="2" style="text-align:left"><b>Most Actives (dollar volume)</b>
</td>, <td colspan="3">
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:X&amp;ei=H7pKWbBtgoabAZ7Kv7gI">United States Steel Corp.</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:X&amp;ei=H7pKWbBtgoabAZ7Kv7gI">X</a>
</td>, <td>21.27
</td>, <td width="20%">
<span class="chg">+0.20</span>
<span class="chg">(0.95%)</span>
</td>, <td>3.77B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:DOW&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Dow Chemical Co</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:DOW&amp;ei=H7pKWbBtgoabAZ7Kv7gI">DOW</a>
</td>, <td>64.01
</td>, <td width="20%">
<span class="chr">-1.09</span>
<span class="chr">(-1.67%)</span>
</td>, <td>78.06B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:NUE&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Nucor Corporation</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:NUE&amp;ei=H7pKWbBtgoabAZ7Kv7gI">NUE</a>
</td>, <td>56.15
</td>, <td width="20%">
<span class="chg">+0.02</span>
<span class="chg">(0.04%)</span>
</td>, <td>18.02B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:VALE&amp;ei=H7pKWbBtgoabAZ7Kv7gI">Vale SA (ADR)</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:VALE&amp;ei=H7pKWbBtgoabAZ7Kv7gI">VALE</a>
</td>, <td>7.94
</td>, <td width="20%">
<span class="chg">+0.17</span>
<span class="chg">(2.19%)</span>
</td>, <td>39.61B
</td>]
[<td style="text-align:left;">
<a href="/finance?q=NYSE:MT&amp;ei=H7pKWbBtgoabAZ7Kv7gI">ArcelorMittal SA (ADR)</a>
</td>, <td style="text-align:left;">
<a href="/finance?q=NYSE:MT&amp;ei=H7pKWbBtgoabAZ7Kv7gI">MT</a>
</td>, <td>20.16
</td>, <td width="20%">
<span class="chg">+0.28</span>
<span class="chg">(1.38%)</span>
</td>, <td>20.06B
</td>][/python]

Now I want to find the first 3 'a' tags and the text for these 'a' tags. So remove the print(cells) statement in above code and re write code as given below:

[python]
with open('Basic Materials.htm') as fp:
    soup=BeautifulSoup(fp,'lxml')
    table=soup.find('div',{'class':'sfe-break-bottom'})
    for row in table.find_all('tr'):
        cells=row.find_all('td')
        for link in cells.find_all('a', limit=3):
            print(link.get_text()) # gets the name 
            print(link.get('href')) # gets the links

But I am getting the following error

AttributeError Traceback (most recent call last) in () 4 for row in table.find_all('tr'): 5 cells=row.find_all('td') ----> 6 for link in cells.find_all('a', limit=3): 7 print(link.get_text()) # gets the name 8 print(link.get('href')) # gets the links ~\Anaconda3\envs\practice\lib\site-packages\bs4\element.py in getattr(self, key) 1805 def getattr(self, key): 1806 raise AttributeError( -> 1807 "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key 1808 ) AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Please can you tell me why I am getting this error? How can I get the first 3 'a' and the text with those tags. thanks

If Mr Aguiar provided the answer you needed then you should, please, mark his answer 'accepted'. — Bill Bell
– Bill Bell, Commented Aug 13, 2017 at 21:03
Does this answer your question? Beautiful Soup: 'ResultSet' object has no attribute 'find_all'? — AMC
– AMC, Commented Mar 22, 2020 at 22:35

Vinícius Figueiredo · Accepted Answer · 2017-08-13 21:58:17Z

2

cells is a list, therefore you can't call directly the method .findAll from it, try creating a list that will replace what you meant by cells.find_all('a', limit=3), you can do something like:

for cell in cells:
    atags = cell.findAll('a',limit=3)
    for link in atags:
        print(link.text)
        print(link['href'])

or using list comprehension:

atags = [cell.findAll('a',limit=3) for cell in cells]
for link in atags:
    print(link[0].text)
    print(link[0]['href'])

edited Aug 13, 2017 at 21:58

answered Aug 13, 2017 at 20:17

Vinícius Figueiredo

6,5234 gold badges30 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Vinícius Figueiredo Over a year ago

@Jonelya Glad to help, please let me know if that worked!

Jonelya Over a year ago

I changed the code as you guided me: with open('Basic Materials.htm') as fp: soup=BeautifulSoup(fp,'lxml') table=soup.find('div',{'class':'sfe-break-bottom'}) for row in table.find_all('tr'): cells=row.find_all('td') atags = [cell.find_all('a',limit=3) for cell in cells] for links in atags: print(links.get_text()) print(links.get('href')) but again I am getting error message: AttributeError: ResultSet object has no attribute 'get_text'.

Vinícius Figueiredo Over a year ago

@Jonelya You are correct, check my edit, try .text instead.

Jonelya Over a year ago

not working.I tried both print(link.text()) and print(link['href']) seperately also..but not working

Jonelya Over a year ago

Thank you brother...its working...thank you sooo much

|

Collectives™ on Stack Overflow

web scraping using nested for loops, BeautifulSoup in python3

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related