Python Web Scraping Problems

Question

I am using Python to scrape AAPL's stock price from Yahoo finance. But the program always returns []. I would appreciate if someone could point out why the program is not working. Here is my code:

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

The original source is like this:

<span id="yfs_l84_aapl" class>112.31</span>

Here I just want the price 112.31. I copy and paste the code and find 'class' changes to 'class=""'. I also tried code

regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'

But it does not work either.

Why not use a proper DOM parser and the .getElementByID('yfs_l84_aapl')? That would be more appropriate than trying to use regex to parse HTML/XML... — David Zemens
– David Zemens, Commented Sep 9, 2015 at 0:59
Thank you for your comment. I am just a beginner and I will definitely try your code. — Allen
– Allen, Commented Sep 9, 2015 at 2:45
Cheers. Although not specific to python, this discusses in detail why RegEx is ill-suited for the task. For a very simple case like yours, with no traversing, and a relatively known structure, regex is probably OK. But then again, the id attribute is a unique identifier so there's no need for RegEx or even DOM "parsing" if the elements can be uniquely identified :) — David Zemens
– David Zemens, Commented Sep 9, 2015 at 2:58
There are also some API available here which I have not used, but which most likely return the data in XML or JSON format which are widely supported by python. Again, it's better than trying to read a web page source and parse the HTML :) Good luck !! — David Zemens
– David Zemens, Commented Sep 9, 2015 at 2:59

Shawn Mehan · Accepted Answer · 2015-09-09 00:51:38Z

5

Well, the good news is that you are getting the data. You were nearly there. I would recommend that you work our your regex problems in a tool that helps, e.g. regex101.

Anyway, here is your working regex:

regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'

You are collecting only digits, so don't do the general catch, be specific where you can. This is multiple digits, with a decimal literal, with two more digits.

answered Sep 9, 2015 at 0:51

Shawn Mehan

4,60810 gold badges33 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Allen Over a year ago

Thank you very much for your suggestion. I tried your code and it works well! I am a beginner to Python and there is a lot for me to learn.

Pyrogrammer · Accepted Answer · 2015-09-09 00:58:47Z

2

When I went to the yahoo site you provided, I saw a span tag without class attribute.

<span id="yfs_l84_aapl">112.31</span>

Not sure what you are trying to do with "class." Without that I get 112.31

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

answered Sep 9, 2015 at 0:58

Pyrogrammer

1831 silver badge12 bronze badges

2 Comments

Allen Over a year ago

Yes, when I went to Yahoo this time, I got the same span tag as you did. I am not sure why I got the other span tag this afternoon. Thanks for your help!

Pyrogrammer Over a year ago

No problem. Have fun with the project XD

galaxyan · Accepted Answer · 2015-09-09 01:33:35Z

1

I am using BeautifulSoup to get the text from span tag

import urllib
from BeautifulSoup import BeautifulSoup

response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list 
print(target[0].string)

answered Sep 9, 2015 at 1:33

galaxyan

6,1593 gold badges23 silver badges44 bronze badges

Collectives™ on Stack Overflow

Python Web Scraping Problems

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related