0

I've been working on a practice web-scraper that gets written reviews and writes them to a csv file, with each review given its own row. I've been having trouble with it as:

  1. I can't seem to strip out the html and get only the text (i.e. the written review and nothing else)
  2. There are a lot of weird spaces between and within even my review text (i.e. a row of space between lines etc.)

Thanks for your help!

Code below:

#! python3

import bs4, os, requests, csv

# Get URL of the page

URL = ('https://www.tripadvisor.com/Attraction_Review-g294265-d2149128-Reviews-Gardens_by_the_Bay-Singapore.html')

# Looping until the 5th page of reviews

pagecounter = 0
while pagecounter != 5:

    # Request get the first page
    res = requests.get(URL)
    res.raise_for_status

    # Download the html of the first page
    soup = bs4.BeautifulSoup(res.text, "html.parser")
    reviewElems = soup.select('.partial_entry')


    if reviewElems == []:
        print('Could not find clue.')

    else:
        #for i in range(len(reviewElems)):
            #print(reviewElems[i].getText())

        with open('GardensbytheBay.csv', 'a', newline='') as csvfile:

            for row in reviewElems:
                writer = csv.writer(csvfile, delimiter=' ', quoting=csv.QUOTE_ALL)
                writer.writerow(row)
            print('Writing page')

    # Find URL of next page and update URL
    if pagecounter == 0:
        nextLink = soup.select('a[data-offset]')[0]

    elif pagecounter != 0:
        nextLink = soup.select('a[data-offset]')[1]

    URL = 'http://www.tripadvisor.com' + nextLink.get('href')
    pagecounter += 1

print('Download complete')
csvfile.close()
2
  • 2) browsers don't care of spaces when they display HTML so people (creating webpages) don't care of spaces too. They (or its functions) add spaces to make it more readable during developning - browsers will skip this spaces when they display HTML. Commented Oct 14, 2016 at 8:44
  • You might want to take a look at: stackoverflow.com/questions/328356/…. Commented Oct 14, 2016 at 9:53

1 Answer 1

1

You can use row.get_text(strip=True) to get the text from your selected p.partial_entry. Try the following:

import bs4, os, requests, csv

# Get URL of the page
URL = ('https://www.tripadvisor.com/Attraction_Review-g294265-d2149128-Reviews-Gardens_by_the_Bay-Singapore.html')

with open('GardensbytheBay.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=' ')

    # Looping until the 5th page of reviews
    for pagecounter in range(6):

        # Request get the first page
        res = requests.get(URL)
        res.raise_for_status

        # Download the html of the first page
        soup = bs4.BeautifulSoup(res.text, "html.parser")
        reviewElems = soup.select('p.partial_entry')

        if reviewElems:
            for row in reviewElems:
                review_text = row.get_text(strip=True).encode('utf8', 'ignore').decode('latin-1')
                writer.writerow([review_text])
            print('Writing page', pagecounter + 1)
        else:
            print('Could not find clue.')

        # Find URL of next page and update URL
        if pagecounter == 0:
            nextLink = soup.select('a[data-offset]')[0]
        elif pagecounter != 0:
            nextLink = soup.select('a[data-offset]')[1]

        URL = 'http://www.tripadvisor.com' + nextLink.get('href')

print('Download complete')
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! That was awesome. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.