0

Why am i not able to parse this web page which is in html into a csv?

url='c:/x/x/x/xyz.html' #html(home page of www.cloudtango.org) data is stored inside a local drive



with open(url, 'r',encoding='utf-8') as f:
    html_string = f.read()

soup= bs4.BeautifulSoup('html_string.parser')
data1= html_string.find_all('td',{'class':'company'})
full=[]
for each in data1:
    comp= each.find('img')['alt']
    desc= each.find_next('td').text
    dd={'company':comp,'description':desc}
    full.append(dd)

Error:

AttributeError: 'str' object has no attribute 'find_all'

1 Answer 1

1

The html_string is of type string, it doesn't have .find_all() method.

To get information from specified URL, you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.cloudtango.org/"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

data1 = soup.find_all("td", {"class": "company"})

full = []
for each in data1:
    comp = each.find("img")["alt"]
    desc = each.find_next("td").text
    dd = {"company": comp, "description": desc}
    full.append(dd)

print(pd.DataFrame(full))

Prints:

                                company                                                                                                                                                                                                description
0                BlackPoint IT Services       BlackPoint’s comprehensive range of Managed IT Services is designed to help you improve IT quality, efficiency and reliability -and save you up to 50% on IT cost. Providing IT solutions for more …
1                  ICC Managed Services        The ICC Group is a global and independent IT solutions company, providing a comprehensive, customer focused service to the SME, enterprise and public sector markets.  \r\n\r\nICC deliver a full …
2                           First Focus      First Focus is Australia’s best managed service provider for medium sized organisations. With tens of thousands of end users supported across hundreds of customers, First Focus has the experience …

...and so on.

EDIT: To read from local file:

import pandas as pd
from bs4 import BeautifulSoup

with open('your_file.html', 'r') as f_in
    soup = BeautifulSoup(f_in.read(), "html.parser")

data1 = soup.find_all("td", {"class": "company"})

full = []
for each in data1:
    comp = each.find("img")["alt"]
    desc = each.find_next("td").text
    dd = {"company": comp, "description": desc}
    full.append(dd)

print(pd.DataFrame(full))
Sign up to request clarification or add additional context in comments.

3 Comments

The url is right but i've saved the data into a html page inside my local drive and i'm retreiving the page from there.
Error "UnicodeDecodeError: 'charmap' codec can't decode byte". I saved the webpage into local drive and i'll be yielding the data from there.
I added "encoding= 'utf-8' inside the open( ) then it worked but it got stopped at the comp= each.find('img')['alt'] step with this error "TypeError: 'NoneType' object is not subscriptable"....Do you have any idea why we see this error

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.