Data Extraction Using Python

Question

I am interested in extracting historical prices from this link: https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=KEL

To do so I am using the following code

import requests
import pandas as pd
import time as t

t0=t.time()

symbols =[
          'HMIM',
           'CWSM','DSIL','RAVT','PIBTL','PICT','PNSC','ASL',
          'DSL','ISL','CSAP','MUGHAL','DKL','ASTL','INIL']

for symbol in symbols:
    header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}
    r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header)
    dfs = pd.read_html(r.text)
    df=dfs[6]
    df=df.ix[2: , ]
    df.columns=['Date','Open','High','Low','Close','Volume']
    df.set_index('Date', inplace=True)
    df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),columns=['Open','High','Low','Close','Volume'],
             index_label=['Date'])

    print(symbol)


t1=t.time()
print('exec time is ', t1-t0, 'seconds')

Above code extracts data from the link converts it into pandas data frame and saves it.

Problem is that it takes a lot of time and is not efficient with greater number of symbols. Can anyone suggest any other way to achieve the above result in an efficient way.

Moreover, is there any other programming language that would do the same job but in a less time.

I would guess that a decent part of the time is in a blocking GET request. What happens if you try to run the requests asynchronously e.g. with requests-futures? — roganjosh
– roganjosh, Commented Apr 18, 2017 at 16:05
Not on my usual PC, downloading some prerequisites to test :) — roganjosh
– roganjosh, Commented Apr 18, 2017 at 16:11
I am new to programming so it would take me time to try running requests asynchronously. Going through the documentation. — Furqan Hashim
– Furqan Hashim, Commented Apr 18, 2017 at 16:17
I'm running the test now, you're right, it's slow. Will hopefully have results in 10 mins or so and will write an answer if it works. — roganjosh
– roganjosh, Commented Apr 18, 2017 at 16:18
I tried using pandas read_html function. But due to mod_security it would give me HTTP error 403. Is there any way to define user agent to this function? Couldn't find any such way in documentation. — Furqan Hashim
– Furqan Hashim, Commented Apr 18, 2017 at 16:28

roganjosh · Accepted Answer · 2017-04-18 17:08:03Z

2

Normal GET requests with requests are "blocking"; one request is sent, one response is received and then processed. At least some portion of your processing time is spent waiting for responses - we can instead send all our requests asynchronously with requests-futures and then collect the responses as they are ready.

That said, I think DSIL is timing out or something similar (I need to look further). While I was able to get a decent speedup with a random selection from symbols, both methods take approx. the same time if DSIL is in the list.

EDIT: Seems I lied, it was just an unfortunate coincidence with "DSIL" on multiple occasions. The more tags you have in symbols, the faster the async method will become over standard requests.

import requests
from requests_futures.sessions import FuturesSession
import time

start_sync = time.time()

symbols =['HMIM','CWSM','RAVT','ASTL','INIL']

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

for symbol in symbols:
    r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header)

end_sync = time.time()

start_async = time.time()
# Setup
session = FuturesSession(max_workers=10)
pooled_requests = []

# Gather request URLs
for symbol in symbols:
    request= 'https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(symbol)
    pooled_requests.append(request)

# Fire the requests
fire_requests = [session.get(url, headers=header) for url in pooled_requests]
responses = [item.result() for item in fire_requests]

end_async = time.time()

print "Synchronous requests took: {}".format(end_sync - start_sync)
print "Async requests took:       {}".format(end_async - start_async)

In the above code, I get a 3x speedup in getting responses. You can iterate through responses list and process each response as normal.

EDIT 2: Going through the responses of the async requests and saving them as you did earlier:

for i, r in enumerate(responses):
    dfs = pd.read_html(r.text)
    df=dfs[6]
    df=df.ix[2: , ]
    df.columns=['Date','Open','High','Low','Close','Volume']
    df.set_index('Date', inplace=True)
    df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(symbols[i]),columns=['Open','High','Low','Close','Volume'],
             index_label=['Date'])

edited Apr 18, 2017 at 17:08

answered Apr 18, 2017 at 16:45

roganjosh

13.3k4 gold badges33 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Furqan Hashim Over a year ago

Nice job. Now its much faster but to save response in data frame I can't achieve that using asynchronous method.

roganjosh Over a year ago

@FurqanHashim I have edited re: the DSIL tag. There should be nothing stopping you writing it as normal. Let me check and edit.

roganjosh Over a year ago

@FurqanHashim Please see Edit 2

Furqan Hashim Over a year ago

Man that's awesome!

Collectives™ on Stack Overflow

Data Extraction Using Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related