0

I'm scraping all the comments from https://www.consumeraffairs.com/privacy/transunion.html website

    page_list = []
    def pagination(soup):
        for i in range(0,32):
            domain = "https://www.consumeraffairs.com/privacy/transunion.html?page="+str(i)                        
            page_list.append(domain)
        return page_list
    pages = pagination(soup)

    print(pages)

how to capture the comments under these pages as it shows

    import time
    comment_list = []
    def get_comments(urls):
        for url in urls:
            try:
                print(url)
                #comment = soup.find_all('div',{'class':'rvw-bd'})
                comment = soup.find_all('div',{'class':'rvw-bd'})             
                print(len(comment))
                for x in range(len(comment)):
                    comment_list.append(comment[x].p.text.strip())            
            except:
                continue
                time.sleep(30)
        return comment_list
    comments = get_comments(pages)

I used this code but it scraps only first 10 in first page. how to fix this

1 Answer 1

1

I think you were on the right track changing the "page=" value in the url, but from the code you posted, it doesn't seem like you changed the soup object to represent the content of each new page. I rewrote some of your code to do this:

from bs4 import BeautifulSoup
import requests
import time

page_list = []
for i in range(0,32):
    domain = "https://www.consumeraffairs.com/privacy/transunion.html?page="+str(i)                        
    page_list.append(domain)

comment_list = []
for page in page_list:
    try:
        content = requests.get(page).content
        soup = BeautifulSoup(content, 'html.parser')
        #comment = soup.find_all('div',{'class':'rvw-bd'})

        comment = soup.find_all('div',{'class':'rvw-bd'})             
        print(len(comment))

        for x in range(len(comment)):
            comment_list.append(comment[x].p.text.strip())            
    except:
        continue
        time.sleep(30)

print(comment_list)
print(len(comment_list))

Let me know if this does/doesn't help!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.