Unable to parse html in Python

Question

Why am i not able to parse this web page which is in html into a csv?

url='c:/x/x/x/xyz.html' #html(home page of www.cloudtango.org) data is stored inside a local drive



with open(url, 'r',encoding='utf-8') as f:
    html_string = f.read()

soup= bs4.BeautifulSoup('html_string.parser')
data1= html_string.find_all('td',{'class':'company'})
full=[]
for each in data1:
    comp= each.find('img')['alt']
    desc= each.find_next('td').text
    dd={'company':comp,'description':desc}
    full.append(dd)

Error:

AttributeError: 'str' object has no attribute 'find_all'

Andrej Kesely · Accepted Answer · 2021-07-14 19:24:10Z

1

The html_string is of type string, it doesn't have .find_all() method.

To get information from specified URL, you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.cloudtango.org/"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

data1 = soup.find_all("td", {"class": "company"})

full = []
for each in data1:
    comp = each.find("img")["alt"]
    desc = each.find_next("td").text
    dd = {"company": comp, "description": desc}
    full.append(dd)

print(pd.DataFrame(full))

Prints:

                                company                                                                                                                                                                                                description
0                BlackPoint IT Services       BlackPoint’s comprehensive range of Managed IT Services is designed to help you improve IT quality, efficiency and reliability -and save you up to 50% on IT cost. Providing IT solutions for more …
1                  ICC Managed Services        The ICC Group is a global and independent IT solutions company, providing a comprehensive, customer focused service to the SME, enterprise and public sector markets.  \r\n\r\nICC deliver a full …
2                           First Focus      First Focus is Australia’s best managed service provider for medium sized organisations. With tens of thousands of end users supported across hundreds of customers, First Focus has the experience …

...and so on.

EDIT: To read from local file:

import pandas as pd
from bs4 import BeautifulSoup

with open('your_file.html', 'r') as f_in
    soup = BeautifulSoup(f_in.read(), "html.parser")

data1 = soup.find_all("td", {"class": "company"})

full = []
for each in data1:
    comp = each.find("img")["alt"]
    desc = each.find_next("td").text
    dd = {"company": comp, "description": desc}
    full.append(dd)

print(pd.DataFrame(full))

edited Jul 14, 2021 at 19:24

answered Jul 14, 2021 at 19:18

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Byte Over a year ago

The url is right but i've saved the data into a html page inside my local drive and i'm retreiving the page from there.

Byte Over a year ago

Error "UnicodeDecodeError: 'charmap' codec can't decode byte". I saved the webpage into local drive and i'll be yielding the data from there.

Byte Over a year ago

I added "encoding= 'utf-8' inside the open( ) then it worked but it got stopped at the comp= each.find('img')['alt'] step with this error "TypeError: 'NoneType' object is not subscriptable"....Do you have any idea why we see this error

Collectives™ on Stack Overflow

Unable to parse html in Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related