WebScraping using python

Question

I'm scraping all the comments from https://www.consumeraffairs.com/privacy/transunion.html website

    page_list = []
    def pagination(soup):
        for i in range(0,32):
            domain = "https://www.consumeraffairs.com/privacy/transunion.html?page="+str(i)                        
            page_list.append(domain)
        return page_list
    pages = pagination(soup)

    print(pages)

how to capture the comments under these pages as it shows

    import time
    comment_list = []
    def get_comments(urls):
        for url in urls:
            try:
                print(url)
                #comment = soup.find_all('div',{'class':'rvw-bd'})
                comment = soup.find_all('div',{'class':'rvw-bd'})             
                print(len(comment))
                for x in range(len(comment)):
                    comment_list.append(comment[x].p.text.strip())            
            except:
                continue
                time.sleep(30)
        return comment_list
    comments = get_comments(pages)

I used this code but it scraps only first 10 in first page. how to fix this

Mica Horton · Accepted Answer · 2020-04-23 01:14:00Z

I think you were on the right track changing the "page=" value in the url, but from the code you posted, it doesn't seem like you changed the soup object to represent the content of each new page. I rewrote some of your code to do this:

from bs4 import BeautifulSoup
import requests
import time

page_list = []
for i in range(0,32):
    domain = "https://www.consumeraffairs.com/privacy/transunion.html?page="+str(i)                        
    page_list.append(domain)

comment_list = []
for page in page_list:
    try:
        content = requests.get(page).content
        soup = BeautifulSoup(content, 'html.parser')
        #comment = soup.find_all('div',{'class':'rvw-bd'})

        comment = soup.find_all('div',{'class':'rvw-bd'})             
        print(len(comment))

        for x in range(len(comment)):
            comment_list.append(comment[x].p.text.strip())            
    except:
        continue
        time.sleep(30)

print(comment_list)
print(len(comment_list))

Let me know if this does/doesn't help!

Collectives™ on Stack Overflow

WebScraping using python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related