0

I am trying to get a timeseries from this website into python: http://www.boerse-frankfurt.de/en/etfs/db+x+trackers+msci+world+information+technology+trn+index+ucits+etf+LU0540980496/price+turnover+history/historical+data#page=1

I've gotten pretty far, but don't know how to get all the data and not just the first 50 rows which you can see on the page. To view them online, you have to click through the results at the bottom of the table. I would like to be able to specify a start and end date in python and get all the corresponding dates and prices in a list. Here is what I have so far:

 from bs4 import BeautifulSoup
 import requests
 import lxml
 import re

 url = 'http://www.boerse-frankfurt.de/en/etfs/db+x+trackers+msci+world+information+technology+trn+index+ucits+etf+LU0540980496/price+turnover+history/historical+data'
 soup = BeautifulSoup(requests.get(url).text)

 dates  = soup.findAll('td', class_='column-date')
 dates  = [re.sub('[\\nt\s]','',d.string) for d in dates]
 prices = soup.findAll('td', class_='column-price')
 prices = [re.sub('[\\nt\s]','',p.string) for p in prices]

1 Answer 1

1

You need to loop through the rest of the pages. You can use POST request to do that. The server expects to receive a structure in each POST request. The structure is defined below in values. The page number is the parameter 'page' of that structure. The structure has several parameters I have not tested but that could be interesting to try, like items_per_page, max_time and min_time. Here below is an example code:

from bs4 import BeautifulSoup
import urllib
import urllib2
import re

url = 'http://www.boerse-frankfurt.de/en/parts/boxes/history/_histdata_full.m'
values = {'COMPONENT_ID':'PREeb7da7a4f4654f818494b6189b755e76', 
    'ag':'103708549', 
    'boerse_id': '12',
    'include_url': '/parts/boxes/history/_histdata_full.m',
    'item_count': '96',
    'items_per_page': '50',
    'lang': 'en',
    'link_id': '',
    'max_time': '2014-09-20',
    'min_time': '2014-05-09',
    'page': 1,
    'page_size': '50',
    'pages_total': '2',
    'secu': '103708549',
    'template': '0',
    'titel': '',
    'title': '',
    'title_link': '',
    'use_external_secu': '1'}

dates = []
prices = []
while True:
    data = urllib.urlencode(values)
    request = urllib.urlopen(url, data)
    soup = BeautifulSoup(request.read())
    temp_dates  = soup.findAll('td', class_='column-date')
    temp_dates  = [re.sub('[\\nt\s]','',d.string) for d in temp_dates]
    temp_prices = soup.findAll('td', class_='column-price')
    temp_prices = [re.sub('[\\nt\s]','',p.string) for p in temp_prices]
    if not temp_prices:
        break
    else:
        dates = dates + temp_dates
        prices = prices + temp_prices
        values['page'] += 1
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot, this looks like exactly what I'm looking for. Two questions though: Do you know how to get this to work in python3? I've used data = urllib.parse.urlencode(values) request = urllib.request.urlopen(url, data.encode('ascii')) soup = BeautifulSoup(request.read())but that doesn't work (I am getting the same dates and prices over and over and the loop never terminates). Also, how did you come up with the values dict in the first place?
You can find examples of POST requests using Python 3 and urllib here. I think you need to create a Request object first: data = urllib.parse.urlencode(values) request = urllib.request.Request(url, data) response = urllib.request.urlopen(request) soup = BeautifulSoup(response.read()). I extracted the dict values using FireBug, a Firefox extension that lets you see the contents of the HTTP requests in your browser.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.