Parsing HTML rows into CSV

Question

First off the html row looks like this:

<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>

I would show the real html but I am sorry to say don't know how to block it. feels shame

Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files in the same directory into a CSV format. This will eventually go into an SQL database. Each directory represents a year and I plan to do at least 5 years.

I have been goofing around with glob as the best way to do this from some advice. This is what I have so far and am stuck.

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/data/NHL/pl0708/pl02*.htm'):
#these files go from pl020001.htm to pl021230.htm sequentially
    soup = BeautifulSoup(open(filename["r"]))
    for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

I realize this is ugly but it's my first time attempting anything like this. This one problem has taken me months to get to this point after realizing that I don't have to manually go through thousands of files copy and pasting into excel. I have also realized that I can kick my computer repeatedly out of frustration and it still works (not recommended). I am getting close and I need to know what to do next to make those CSV files. Please help or my monitor finally gets hammer punched.

btw thanks to MYYN for the help b4. I gave up and now I am back with a (hopefully) clearer more specific question. — northnodewolf
– northnodewolf, Commented Jul 6, 2009 at 9:53
To show code indent it 4 spaces and it will be automatically escaped for you. — Paolo Bergantino
– Paolo Bergantino, Commented Jul 6, 2009 at 9:53

Hank Gay · Accepted Answer · 2009-07-06 11:17:50Z

4

You need to import the csv module by adding import csv to the top of your file.

Then you'll need something to create a csv file outside your loop of the rows, like so:

writer = csv.writer(open("%s.csv" % filename, "wb"))

Then you need to actually pull the data out of the html row in your loop, similar to

values = (td.fetchText() for td in row)
writer.writerow(values)

answered Jul 6, 2009 at 11:17

Hank Gay

72.4k36 gold badges164 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

northnodewolf Over a year ago

Yes yes. This is what I am talking about. Thanks. I also realized that maybe I need to import re for regex? We are using '*' and '%' so we need to import re, maybe? For the second part you say I need to pull the data out of the html row... but if the rows have been written from html to csv what's the point. I am probably not wrapping my head around something but even if there are 1230 .csv files that's good enough for me right now. Here's a link to the files I am working with: nhl.com/scores/htmlreports/20082009/PL020808.HTM

Judy2K · Accepted Answer · 2009-07-06 10:06:04Z

4

You don't really explain why you are stuck - what's not working exactly?

The following line may well be your problem:

soup = BeautifulSoup(open(filename["r"]))

It looks to me like this should be:

soup = BeautifulSoup(open(filename, "r"))

The following line:

for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

looks like it will only pick out even rows (assuming your even rows have the class 'evenColor' and odd rows have 'oddColor'). Assuming you want all rows with a class of either evenColor or oddColor, you can use a regular expression to match the class value:

for row in soup.findAll("tr", attrs={ "class" : re.compile(r"evenColor|oddColor") })

answered Jul 6, 2009 at 10:06

Judy2K

8445 silver badges9 bronze badges

1 Comment

northnodewolf Over a year ago

@It looks to me like this should be: soup = BeautifulSoup(open(filename, "r")) --thanks I changed it

Lennart Regebro · Accepted Answer · 2009-07-06 10:02:33Z

2

That looks fine, and BeautifulSoup is useful for this (although I personally tend to use lxml). You should be able to take that data you get, and make a csv file out of is using the csv module without any obvious problems...

I think you need to actually tell us what the problem is. "It still doesn't work" is not a problem descripton.

answered Jul 6, 2009 at 10:02

Lennart Regebro

173k45 gold badges230 silver badges254 bronze badges

Collectives™ on Stack Overflow

Parsing HTML rows into CSV

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related