2

New to Python

Trying to scrape some desired info from a webpage. First thing I would like to get is all HTML between today and yesterday's dates. Here is what I have so far

import datetime
import urllib
import re

t = datetime.date.today()
t1 = t.strftime("%B %d, %Y")
y = datetime.date.today() - datetime.timedelta(1)
y1 = y.strftime("%B %d, %Y")

htmlfile = urllib.urlopen("http://www.blu-ray.com/itunes/movies.php?show=newreleases")
htmltext = htmlfile.read()

block1 = re.search(t1 + r'(.*)' + re.escape(y1), htmltext)
print block1

From what I can tell (and I'm probably wrong), my regex should grab what I want it to, so that I can then start pulling out info from today's date only. But it returns 'None'.

I'm sure that it's just my limited understanding as I am new to this but any help would be greatly appreciated. Thanks a lot!

1
  • 1
    The problem is that .* doesn't match line breaks. But you really should use a HTML parser, like alecxe said. Commented Dec 17, 2014 at 20:38

2 Answers 2

2

Don't use regular expression for parsing HTML, use an HTML Parser, like BeautifulSoup.

This would be a lot of code, but the idea is to iterate over all h3 elements that contain the date in the specified format (%B %d, %Y), then get all next table tags until we hit an another h3 tag or an end:

from datetime import datetime
import urllib
from bs4 import BeautifulSoup

data = urllib.urlopen("http://www.blu-ray.com/itunes/movies.php?show=newreleases")
soup = BeautifulSoup(data)

def is_date(d):
    try:
        datetime.strptime(d, '%B %d, %Y')
        return True
    except (ValueError, TypeError):
        return False

for date in soup.find_all('h3', text=is_date):
    print date.text

    for element in date.find_next_siblings(['h3', 'table']):
        if element.name == 'h3':
            break

        print element.a.get('title')
    print "----"

Prints:

December 17, 2014
App (2013)
----
December 16, 2014
The Equalizer (2014)
Annabelle (2014)
A Walk Among the Tombstones (2014)
The Guest (2014)
Men, Women & Children (2014)
At the Devil's Door (2014)
The Canal (2014)
The Bitter Tears of Petra von Kant (1972)
Avatar (2009)
Atlas Shrugged Part III: Who Is John Galt? (2014)
Expelled (2014)
Level Five (1997)
The Device (2014)
Two-Bit Waltz (2014)
The Devil's Hand (2014)
----
December 15, 2014
Star Trek: The Next Generation, Season 6 (1992-1993)
Ristorante Paradiso, Season 1 (2009)
A Certain Magical Index II, Season 2, Pt. 2 (2011)
Cowboy Bebop, The Complete Series (1998-1999)

Feel free to ask additional questions about the posted code - would be glad to explain.

Sign up to request clarification or add additional context in comments.

Comments

0

Your code was throwing an error on t.strftime("%B %d, %Y").

The correct format for the line is t1 = strftime("%B %d, %Y", t)

I was also getting: TypeError: argument must be 9-item sequence, not datetime.datetime

From this error, you can search for many solutions. I don't know which version of Python you're using, but the solutions use the entire time, not just the date. So you probably need to get the time and subtract a day.

See here: Extract time from datetime and determine if time (not date) falls within range?

And here: How can I generate POSIX values for yesterday and today at midnight in Python?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.