Cant able to find a class in html while scraping with python

Question

I have been trying to scrape out some data from this url. However, I wasn't able to scrape see if you can identify the pest. There was a class named "collapsefaq-content" which beautifulsoup wasn't able to find.

I want to scrape all the

tag data under this class.

Here is my code:

import urllib.request
import csv
import pandas as pd
from bs4 import BeautifulSoup
import html5lib
import lxml
page_url = 'http://www.agriculture.gov.au/pests-diseases-weeds/plant#identify-pests-diseases'
page = urllib.request.urlopen(page_url)
soup = BeautifulSoup(page, 'html.parser')

file_name = "alpit.csv"
main_url = []
see_if_you_can = []
see_if_you_can.append("Indetify")
legal =[]
legal.append('Legal Stuff')
specimen =[]
specimen.append("Specimen")
insect_name = []
insect_name.append("Name of insect")
disease_name = []
disease_name.append("Name")
disease_list = []
disease_list.append("URL")
origin = []
origin.append('Origin')

for insectName in soup.find_all('li', attrs={'class': 'flex-item'}):

    if(str(insectName.a.attrs['href']).startswith('/')):
        # to go in the link and extract data
        main_url.append('http://www.agriculture.gov.au' +
                            insectName.a.attrs['href'])
        print(insectName.text.strip())  # disease name

        for name in insectName.find_all('img'):
            print('http://www.agriculture.gov.au' +
                name.attrs['src'])  # disease link
            disease_list.append('http://www.agriculture.gov.au' +
                name.attrs['src'])

for disease in main_url:
    if(True):
        # disease = 'http://www.agriculture.gov.au'+disease
        
        inner_page = urllib.request.urlopen(disease)
        soup_list = BeautifulSoup(inner_page, 'lxml')
        for detail in soup_list.find_all('strong'):

            if(detail.text == 'Origin: '):
                origin.append(detail.next_sibling.strip())
                print(detail.next_sibling.strip())
            
        for name in soup_list.find_all('div', class_='pest-header-content'):
            print(name.h2.text)
            insect_name.append(name.h2.text)
        
        for textin in soup_list.find_all('div',class_ = "collapsefaq-content"):
            print("*******")
            print(textin.text)
        
            



# print('alpit')
# print(len(disease_list))
# print(len(origin))



df = pd.DataFrame([insect_name, disease_list, origin,see_if_you_can, legal, specimen])
df = df.transpose()
df.to_csv(file_name, index=False, header=None)

# with open('alpit.csv','w') as myfile:
#   wr =  csv.writer(myfile)
#   for val in disease_list:
#       wr.writerow([val])
#   for val in origin:
#       wr.writerow([val])

Even the "***" are not being printed. Can anyone tell me what i am doing wrong here...?

Possible duplicate of Reading dynamically generated web pages using python — ivan_pozdeev
– ivan_pozdeev, Commented May 29, 2018 at 19:46
That class is not present in the HTML, it's probably added later with JS logic. — ivan_pozdeev
– ivan_pozdeev, Commented May 29, 2018 at 19:47
What do you wanna parse from there? The content you look for are not dynamically generated. — SIM
– SIM, Commented May 29, 2018 at 19:56
@Alpit Anand , do not provide the complete script, but just relevant part — Andersson
– Andersson, Commented May 29, 2018 at 20:19

SIM · Accepted Answer · 2018-05-29 20:03:22Z

2

This is how you can get the content of that desired portion. I suppose, you can sort out the rest as per your requirement.

import requests
from bs4 import BeautifulSoup

URL = 'http://www.agriculture.gov.au/pests-diseases-weeds/plant/khapra-beetle#see-if-you-can-identify-the-pest'

res = requests.get(URL)
soup = BeautifulSoup(res.text, "lxml")
container = soup.select_one("#collapsefaq h3[title='expand section']")
print(container.get_text(strip=True))

Output:

See if you can identify the pest

You can access the rest using:

container = soup.select_one("#collapsefaq h3[title='expand section']").find_next_sibling()
print(container.get_text(strip=True))

answered May 29, 2018 at 20:03

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Cant able to find a class in html while scraping with python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related