4

I`m trying to scrape some data for airlines from the following website: http://www.airlinequality.com/airline-reviews/airasia-x[1].

I managed to get the data I need, but I am struggling with pagination on the web page. I`m trying to get all the title of the reviews (not only the ones in the first page).

The links of the pages are in the format: http://www.airlinequality.com/airline-reviews/airasia-x/page/3/ where 3 is the number of the page.

I tried to loop through these URLs and also the following piece of code but scraping through the pagination is not working.

# follow pagination links
for href in response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > ul li a'):
    yield response.follow(href, self.parse)

How to solve this?

import scrapy
import re  # for text parsing
import logging
from scrapy.crawler import CrawlerProcess


class AirlineSpider(scrapy.Spider):
    name = 'airlineSpider'
    # page to scrape
    start_urls = ['http://www.airlinequality.com/review-pages/a-z-airline-reviews/']  

    def parse(self, response):
        # take each element in the list of the airlines

        for airline in response.css("div.content ul.items li"):
            # go inside the URL for each airline
            airline_url = airline.css('a::attr(href)').extract_first()

            # Call parse_airline
            next_page = airline_url
            if next_page is not None:
                yield response.follow(next_page, callback=self.parse_article)

            # follow pagination links
            for href in response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > ul li a'):
                yield response.follow(href, self.parse)

    # to go to the pages inside the links (for each airline) - the page where the reviews are
    def parse_article(self, response):
        yield {
            'appears_ulr': response.url,
            # use sub to replace \n\t \r from the result
            'title':  re.sub('\s+', ' ', (response.css('div.info [itemprop="name"]::text').extract_first()).strip(' \t \r \n').replace('\n', ' ') ).strip(),
            'reviewTitle': response.css('div.body .text_header::text').extract(),
            #'total': response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > div.pagination-total::text').extract_first().split(" ")[4],
        }


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_FORMAT': 'json',
    'FEED_URI': 'air_test.json'
})

# minimizing the information presented on the scrapy log
logging.getLogger('scrapy').setLevel(logging.WARNING)
process.crawl(AirlineSpider)
process.start()

To iterate through the airlines I solved it using this code: it using the piece of code above:

req = Request("http://www.airlinequality.com/review-pages/a-z-airline-reviews/" , headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req)
soupAirlines = BeautifulSoup(html_page, "lxml")

URL_LIST = []
for link in soupAirlines.findAll('a',  attrs={'href': re.compile("^/airline-reviews/")}):
    URL_LIST.append("http://www.airlinequality.com"+link.get('href'))

1 Answer 1

2

Assuming scrapy is not a hard requirement, the following code in BeautifulSoup will get you all the reviews, with meta data parsed out, and a final output of a pandas DataFrame. The specific attributes being pulled from each review includes:

  • Review Title
  • Rating (when available)
  • Rating out of scale (i.e. out of 10)
  • Review full text
  • Date stamp of review
  • Whether or not the review is verified

There is a specific function that handles the pagination. It is a recursive function in that if there is a next page, we call the function again to parse the new url, otherwise the function calls end.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# define global parameters
URL = 'http://www.airlinequality.com/airline-reviews/airasia-x'
BASE_URL = 'http://www.airlinequality.com'
MASTER_LIST = []

def parse_review(review):
    """
    Parse important review meta data such as ratings, time of review, title, 
    etc.

    Parameters
    -------
    review - beautifulsoup tag 

    Return 
    -------
    outdf - pd.DataFrame
        DataFrame representation of parsed review
    """

    # get review header
    header = review.find('h2').text

    # get the numerical rating
    base_review = review.find('div', {'itemprop': 'reviewRating'})
    if base_review is None:
        rating = None
        rating_out_of = None
    else:
        rating = base_review.find('span', {'itemprop': 'ratingValue'}).text
        rating_out_of = base_review.find('span', {'itemprop': 'bestRating'}).text

    # get time of review
    time_of_review = review.find('h3').find('time')['datetime']

    # get whether review is verified
    if review.find('em'):
        verified = review.find('em').text
    else:
        verified = None

    # get actual text of review
    review_text = review.find('div', {'class': 'text_content'}).text

    outdf = pd.DataFrame({'header': header,
                         'rating': rating,
                         'rating_out_of': rating_out_of,
                         'time_of_review': time_of_review,
                         'verified': verified,
                         'review_text': review_text}, index=[0])

    return outdf

def return_next_page(soup):
    """
    return next_url if pagination continues else return None

    Parameters
    -------
    soup - BeautifulSoup object - required

    Return 
    -------
    next_url - str or None if no next page
    """
    next_url = None
    cur_page = soup.find('a', {'class': 'active'}, href=re.compile('airline-reviews/airasia'))
    cur_href = cur_page['href']
    # check if next page exists
    search_next = cur_page.findNext('li').get('class')
    if not search_next:
        next_page_href = cur_page.findNext('li').find('a')['href']
        next_url = BASE_URL + next_page_href
    return next_url

def create_soup_reviews(url):
    """
    iterate over each review, extract out content, and handle next page logic 
    through recursion

    Parameters
    -------
    url - str - required
        input url
    """
    # use global MASTER_LIST to extend list of all reviews 
    global MASTER_LIST
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    reviews = soup.findAll('article', {'itemprop': 'review'})
    review_list = [parse_review(review) for review in reviews]
    MASTER_LIST.extend(review_list)
    next_url = return_next_page(soup)
    if next_url is not None:
        create_soup_reviews(next_url)


create_soup_reviews(URL)


finaldf = pd.concat(MASTER_LIST)
finaldf.shape # (339, 6)

finaldf.head(2)
# header    rating  rating_out_of   review_text time_of_review  verified
#"if approved I will get my money back" 1   10  ✅ Trip Verified | Kuala Lumpur to Melbourne. ...    2018-08-07  Trip Verified
#   "a few minutes error"   3   10  ✅ Trip Verified | I've flied with AirAsia man...    2018-08-06  Trip Verified

If I were to do the whole site, I would use the above and iterate over each airline here. I would modify the code to include a column named 'airline' so you know which airline each review corresponds to.

Sign up to request clarification or add additional context in comments.

3 Comments

Hello @datawrestler. Thanks for your help, the Beautiful soup is a great suggestion. Could you please help me to iterate through all the airlines? I'm trying to get all the airline URL and call create_soup_review for each of them, but I`m not successful in constructing automatically the URL list of all the airlines.
Thanks @datawrestler. I managed to solve it, using the above code
@onra that's great news, happy to help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.