9

I am using Python to scrape AAPL's stock price from Yahoo finance. But the program always returns []. I would appreciate if someone could point out why the program is not working. Here is my code:

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

The original source is like this:

<span id="yfs_l84_aapl" class>112.31</span>

Here I just want the price 112.31. I copy and paste the code and find 'class' changes to 'class=""'. I also tried code

regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'

But it does not work either.

4
  • 1
    Why not use a proper DOM parser and the .getElementByID('yfs_l84_aapl')? That would be more appropriate than trying to use regex to parse HTML/XML... Commented Sep 9, 2015 at 0:59
  • Thank you for your comment. I am just a beginner and I will definitely try your code. Commented Sep 9, 2015 at 2:45
  • Cheers. Although not specific to python, this discusses in detail why RegEx is ill-suited for the task. For a very simple case like yours, with no traversing, and a relatively known structure, regex is probably OK. But then again, the id attribute is a unique identifier so there's no need for RegEx or even DOM "parsing" if the elements can be uniquely identified :) Commented Sep 9, 2015 at 2:58
  • There are also some API available here which I have not used, but which most likely return the data in XML or JSON format which are widely supported by python. Again, it's better than trying to read a web page source and parse the HTML :) Good luck !! Commented Sep 9, 2015 at 2:59

3 Answers 3

5

Well, the good news is that you are getting the data. You were nearly there. I would recommend that you work our your regex problems in a tool that helps, e.g. regex101.

Anyway, here is your working regex:

regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'

You are collecting only digits, so don't do the general catch, be specific where you can. This is multiple digits, with a decimal literal, with two more digits.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much for your suggestion. I tried your code and it works well! I am a beginner to Python and there is a lot for me to learn.
2

When I went to the yahoo site you provided, I saw a span tag without class attribute.

<span id="yfs_l84_aapl">112.31</span>

Not sure what you are trying to do with "class." Without that I get 112.31

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

2 Comments

Yes, when I went to Yahoo this time, I got the same span tag as you did. I am not sure why I got the other span tag this afternoon. Thanks for your help!
No problem. Have fun with the project XD
1

I am using BeautifulSoup to get the text from span tag

import urllib
from BeautifulSoup import BeautifulSoup

response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list 
print(target[0].string)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.