0

I have been trying to scrape out some data from this url. However, I wasn't able to scrape see if you can identify the pest. There was a class named "collapsefaq-content" which beautifulsoup wasn't able to find.

I want to scrape all the

tag data under this class.

Here is my code:

import urllib.request
import csv
import pandas as pd
from bs4 import BeautifulSoup
import html5lib
import lxml
page_url = 'http://www.agriculture.gov.au/pests-diseases-weeds/plant#identify-pests-diseases'
page = urllib.request.urlopen(page_url)
soup = BeautifulSoup(page, 'html.parser')

file_name = "alpit.csv"
main_url = []
see_if_you_can = []
see_if_you_can.append("Indetify")
legal =[]
legal.append('Legal Stuff')
specimen =[]
specimen.append("Specimen")
insect_name = []
insect_name.append("Name of insect")
disease_name = []
disease_name.append("Name")
disease_list = []
disease_list.append("URL")
origin = []
origin.append('Origin')

for insectName in soup.find_all('li', attrs={'class': 'flex-item'}):

    if(str(insectName.a.attrs['href']).startswith('/')):
        # to go in the link and extract data
        main_url.append('http://www.agriculture.gov.au' +
                            insectName.a.attrs['href'])
        print(insectName.text.strip())  # disease name

        for name in insectName.find_all('img'):
            print('http://www.agriculture.gov.au' +
                name.attrs['src'])  # disease link
            disease_list.append('http://www.agriculture.gov.au' +
                name.attrs['src'])

for disease in main_url:
    if(True):
        # disease = 'http://www.agriculture.gov.au'+disease
        
        inner_page = urllib.request.urlopen(disease)
        soup_list = BeautifulSoup(inner_page, 'lxml')
        for detail in soup_list.find_all('strong'):

            if(detail.text == 'Origin: '):
                origin.append(detail.next_sibling.strip())
                print(detail.next_sibling.strip())
            
        for name in soup_list.find_all('div', class_='pest-header-content'):
            print(name.h2.text)
            insect_name.append(name.h2.text)
        
        for textin in soup_list.find_all('div',class_ = "collapsefaq-content"):
            print("*******")
            print(textin.text)
        
            



# print('alpit')
# print(len(disease_list))
# print(len(origin))



df = pd.DataFrame([insect_name, disease_list, origin,see_if_you_can, legal, specimen])
df = df.transpose()
df.to_csv(file_name, index=False, header=None)

# with open('alpit.csv','w') as myfile:
#   wr =  csv.writer(myfile)
#   for val in disease_list:
#       wr.writerow([val])
#   for val in origin:
#       wr.writerow([val])

Even the "***" are not being printed. Can anyone tell me what i am doing wrong here...?

4
  • Possible duplicate of Reading dynamically generated web pages using python Commented May 29, 2018 at 19:46
  • 1
    That class is not present in the HTML, it's probably added later with JS logic. Commented May 29, 2018 at 19:47
  • What do you wanna parse from there? The content you look for are not dynamically generated. Commented May 29, 2018 at 19:56
  • @Alpit Anand , do not provide the complete script, but just relevant part Commented May 29, 2018 at 20:19

1 Answer 1

2

This is how you can get the content of that desired portion. I suppose, you can sort out the rest as per your requirement.

import requests
from bs4 import BeautifulSoup

URL = 'http://www.agriculture.gov.au/pests-diseases-weeds/plant/khapra-beetle#see-if-you-can-identify-the-pest'

res = requests.get(URL)
soup = BeautifulSoup(res.text, "lxml")
container = soup.select_one("#collapsefaq h3[title='expand section']")
print(container.get_text(strip=True))

Output:

See if you can identify the pest

You can access the rest using:

container = soup.select_one("#collapsefaq h3[title='expand section']").find_next_sibling()
print(container.get_text(strip=True))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.