Search HTML line by line with regex in Python

Question

I'm attempting to create a dictionary of hours based off of this calendar: http://disneyworld.disney.go.com/parks/magic-kingdom/calendar/

<td class="first"> <div class="dayContainer">
      <a href="/parks/magic-kingdom/calendardayview/?asmbly_day=20120401"> 
         <p class="day"> 1
         </p> <p class="moreLink">Park Hours<br />8:00 AM - 12:00 AM<br /><br/>Extra Magic Hours<br />7:00 AM - 8:00 AM<br /><br/>Extra Magic Hours<br />12:00 AM - 3:00 AM<br /><br/>
         </p> 
      </a> 
   </div>
</td>

Each of the calendar entries are on a single line, so I figured it would be best to just go through the HTML line by line, and if that line contains hours, add those hours to a dictionary for the corresponding date (some days have multiple hour entries).

import urllib
import re
source = urllib.urlopen('http://disneyworld.disney.go.com/parks/magic-kingdom/c\
alendar/')
page = source.read()
prkhrs = {}

def main():
    parsehours()

def parsehours():
    #look for #:## AM - #:## PM                                                 
    date = r'201204\d{02}'
    hours = r'\d:0{2}\s\w{2}\s-\s\d:0{2}\s\w{2}'
    #go through page line by line                                               
    for line in page:
        times = re.findall(hours, line)
        dates = re.search(date, line)
        if dates:
            start = dates.start()
            end = dates.end()
            curdate = line[start:end]
        #if #:## - #:## is found, a date has been found                         
        if times:
            #create dictionary from date, stores hours in variable              
            #extra magic hours(emh) are stored in same format.                  
            #if entry has 2/3 hour listings, those listings are emh             
            prkhrs[curdate]['hours'] = times
    #just print hours for now. will change later                                
    print prkhrs

The problem I encounter is that when I put 'print line' inside the for loop that goes through the page, it prints it out a character at a time, which I'm assuming is what's messing things up.

Right now, the 'print prkhrs' just prints nothing, but using re.findall for both the dates and the hours prints out the correct times, so I know the regex works. Any suggestions on how I can get it to work?

@Mimisbrunnr, This doesn't appear to be using regex to parse HTML despite the title and tags. — aaronasterling
– aaronasterling, Commented Apr 4, 2012 at 21:52
@aaronasterling - westbyb is parsing data out of the document via use of regex. He may not be directly parsing the HTML it self but it would be much easier to turn the HTML into a representative object structure and pull the data out usefully from there. — zellio
– zellio, Commented Apr 4, 2012 at 21:56
@Mimisbrunnr - how would you recommend I go about doing it that way? Regex is really the only way I know about going this, at least for now. — westbyb
– westbyb, Commented Apr 4, 2012 at 22:07

Whatang · Accepted Answer · 2012-04-04 21:49:19Z

6

Change page = source.read() to page = source.readlines()

source.read() returns the whole page as one big string. Iterating over a string (as when you do for line in page) returns one character at a time. Just because your variables are called line and page doesn't mean Python knows what you want.

source.readlines() returns a list of strings, each of which is a line from the page.

answered Apr 4, 2012 at 21:49

Whatang

10.4k2 gold badges24 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Whatang Over a year ago

FWIW, I agree with the commenters on the question that rather than using a regex in this way, you'd be better off parsing the HTML properly. But, for the question you actually asked this is a valid answer.

westbyb Over a year ago

That did it! I ran into some other issues, but that fixed the problem at hand. Thanks!

Collectives™ on Stack Overflow

Search HTML line by line with regex in Python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related