2

I'm trying to make a web scraper to get data from the following website (I would later like to do it for several airlines on the same website): https://www.flightradar24.com/data/airlines/kl-klm/routes

I currently have the following code:

from bs4 import BeautifulSoup
import requests

airlines = ['kl-klm']

for a in airlines:
    url = 'https://www.flightradar24.com/data/airlines/' + a + '/routes'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    print(soup)

This gives me the source code for the whole page, but I would like to extract a specific chunk of text within script tags, which is

var arrRoutes=[{"airport1":{"country":"Denmark","iata":"AAL","icao":"EKYT","lat":57.092781,"lon":9.849164,"name":"Aalborg Airport"},"airport2":{"country":"Netherlands","iata":"AMS","icao":"EHAM","lat":52.308609,"lon":4.763889,"name":"Amsterdam Schiphol Airport"}},{"airport1":{"country":"United Kingdom","iata":"ABZ","icao":"EGPD","lat":57.201939,"lon":-2.19777,"name":"Aberdeen International Airport"},"airport2":{"country":"Netherlands","iata":"AMS","icao":"EHAM","lat":52.308609,"lon":4.763889,"name":"Amsterdam Schiphol Airport"}}...

...etc. all the way until the end of the list.

How can I extract this in such a way that I can find the total number of inbound and outbound flights for each airport? For example, the total number of times Amsterdam Schiphol Airport appears as airport 1 or 2?

Is there a way to first extract the string from the HTML and then convert it into a Python list with dictionaries? Or would it make more sense to just directly count each element in the string?

2 Answers 2

3

You can extract data to python list using ast.literal_eval. I made a simple function find_airport(), where you supply data and airport name, and returns how many times it is in airport_1 and airport_2:

from bs4 import BeautifulSoup
import requests
import re
from ast import literal_eval
from pprint import pprint

airlines = ['kl-klm']

headers = {"Host":"www.flightradar24.com",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding":"gzip,deflate,br",
"User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}

def find_aiport(data, name):
    airport_1, airport_2 = 0, 0
    for d in data:
        if d['airport1']['name'] == name:
            airport_1 += 1
        if d['airport2']['name'] == name:
            airport_2 += 1
    return airport_1, airport_2

for a in airlines:
    url = 'https://www.flightradar24.com/data/airlines/' + a + '/routes'
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.text, 'lxml')

    m = re.search(r'(?<=arrRoutes=)\[\{(.*?)\}\]', soup.text)
    l = literal_eval(m[0])
    pprint(l)

    print(find_aiport(l, 'Amsterdam Schiphol Airport'))

Prints:

[{'airport1': {'country': 'Denmark',
               'iata': 'AAL',
               'icao': 'EKYT',
               'lat': 57.092781,
               'lon': 9.849164,
               'name': 'Aalborg Airport'},
  'airport2': {'country': 'Netherlands',
               'iata': 'AMS',
               'icao': 'EHAM',
               'lat': 52.308609,
               'lon': 4.763889,
               'name': 'Amsterdam Schiphol Airport'}},
 {'airport1': {'country': 'United Kingdom',
               'iata': 'ABZ',
               'icao': 'EGPD',
               'lat': 57.201939,
               'lon': -2.19777,
               'name': 'Aberdeen International Airport'},
  'airport2': {'country': 'Netherlands',
               'iata': 'AMS',
               'icao': 'EHAM',
               'lat': 52.308609,
               'lon': 4.763889,
               'name': 'Amsterdam Schiphol Airport'}},

...and so on

And at the end:

(147, 146)

For "Amsterdam Schiphol Airport"

Sign up to request clarification or add additional context in comments.

1 Comment

Fantastic, exactly what I was looking for. Thanks!
1

Use re.compile

Ex:

import re

soup = BeautifulSoup(page.text, 'html.parser')
jData = soup.find("script", text=re.compile(r"var arrRoutes=.*?")).string
print( jData.replace("var arrRoutes=", ""))

Output:

[{"airport1":{"country":"Denmark","iata":"AAL","icao":"EKYT","lat":57.092781,"lon":9.849164,"name":"Aalborg Airport"},"airport2":{"country":"Netherlands","iata":"AMS","icao":"EHAM","lat":52.308609,"lon":4.763889,"name":"Amsterdam Schiphol Airport"}},{"airport1":{"country":"United Kingdom","iata":"ABZ","icao":"EGPD","lat":57.201939,"lon":-2.19777,"name":"Aberdeen International Airport"},"airport2":{"country":"Netherlands","iata":"AMS","icao":"EHAM","lat":52.308609,"lon":4.763889,"name":"Amsterdam Schiphol Airport"}},......

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.