How do I write web-scraped text into csv using python?

Question

I've been working on a practice web-scraper that gets written reviews and writes them to a csv file, with each review given its own row. I've been having trouble with it as:

I can't seem to strip out the html and get only the text (i.e. the written review and nothing else)
There are a lot of weird spaces between and within even my review text (i.e. a row of space between lines etc.)

Thanks for your help!

Code below:

#! python3

import bs4, os, requests, csv

# Get URL of the page

URL = ('https://www.tripadvisor.com/Attraction_Review-g294265-d2149128-Reviews-Gardens_by_the_Bay-Singapore.html')

# Looping until the 5th page of reviews

pagecounter = 0
while pagecounter != 5:

    # Request get the first page
    res = requests.get(URL)
    res.raise_for_status

    # Download the html of the first page
    soup = bs4.BeautifulSoup(res.text, "html.parser")
    reviewElems = soup.select('.partial_entry')


    if reviewElems == []:
        print('Could not find clue.')

    else:
        #for i in range(len(reviewElems)):
            #print(reviewElems[i].getText())

        with open('GardensbytheBay.csv', 'a', newline='') as csvfile:

            for row in reviewElems:
                writer = csv.writer(csvfile, delimiter=' ', quoting=csv.QUOTE_ALL)
                writer.writerow(row)
            print('Writing page')

    # Find URL of next page and update URL
    if pagecounter == 0:
        nextLink = soup.select('a[data-offset]')[0]

    elif pagecounter != 0:
        nextLink = soup.select('a[data-offset]')[1]

    URL = 'http://www.tripadvisor.com' + nextLink.get('href')
    pagecounter += 1

print('Download complete')
csvfile.close()

2) browsers don't care of spaces when they display HTML so people (creating webpages) don't care of spaces too. They (or its functions) add spaces to make it more readable during developning - browsers will skip this spaces when they display HTML. — furas
– furas, Commented Oct 14, 2016 at 8:44
You might want to take a look at: stackoverflow.com/questions/328356/…. — Jon Betts
– Jon Betts, Commented Oct 14, 2016 at 9:53

Martin Evans · Accepted Answer · 2016-10-18 13:02:57Z

1

You can use row.get_text(strip=True) to get the text from your selected p.partial_entry. Try the following:

import bs4, os, requests, csv

# Get URL of the page
URL = ('https://www.tripadvisor.com/Attraction_Review-g294265-d2149128-Reviews-Gardens_by_the_Bay-Singapore.html')

with open('GardensbytheBay.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=' ')

    # Looping until the 5th page of reviews
    for pagecounter in range(6):

        # Request get the first page
        res = requests.get(URL)
        res.raise_for_status

        # Download the html of the first page
        soup = bs4.BeautifulSoup(res.text, "html.parser")
        reviewElems = soup.select('p.partial_entry')

        if reviewElems:
            for row in reviewElems:
                review_text = row.get_text(strip=True).encode('utf8', 'ignore').decode('latin-1')
                writer.writerow([review_text])
            print('Writing page', pagecounter + 1)
        else:
            print('Could not find clue.')

        # Find URL of next page and update URL
        if pagecounter == 0:
            nextLink = soup.select('a[data-offset]')[0]
        elif pagecounter != 0:
            nextLink = soup.select('a[data-offset]')[1]

        URL = 'http://www.tripadvisor.com' + nextLink.get('href')

print('Download complete')

answered Oct 18, 2016 at 13:02

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Domoman Over a year ago

Thank you! That was awesome. :)

Collectives™ on Stack Overflow

How do I write web-scraped text into csv using python?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related