how to extract specific csv from web page html containing multiple csv file links

Question

I need to extract csv file from html page see below and once I get that I can do stuff with it. below is code to extract that particular line of html code from a previous assignment. The url is 'https://vincentarelbundock.github.io/Rdatasets/datasets.html' that is test code so it breaks temporarly when it finds that line. part of the line with my csv is href is csv/datasets/co2.csv ( unicode I think as type)

how to open the co2.csv? sorry about any formatting issues with the question. The code has been sliced and diced by the editor.

import urllib
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *

def scrapper(url,k):
    c=0
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html)
#.    Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        y= (tag.get('href', None))
        #print ((y))
        if y == 'csv/datasets/co2.csv':
            print y
            break
        c= c+ 1

        if c is k:
            return y
            print(type(y))

for w in range(29):
    print(scrapper(url,w))

Please improve your question: it is not clear if you only want to process the single co2.csv file, or if you want to process all csv files that are linked on the html page. — Irmen de Jong
– Irmen de Jong, Commented Nov 9, 2016 at 22:59
And I want to take that file and do data analysis on it (linear regression). — Cliff
– Cliff, Commented Nov 9, 2016 at 23:06

Irmen de Jong · Accepted Answer · 2016-11-09 23:22:02Z

1

You're re-downloading and reparsing the full html page for all of the 30 iterations of your loop, just to get the next csv file and see if that is the one you want. That is very inefficient, and not very polite to the server. Just read the html page once, and use the loop over the tags you already had to check if the tag is the one you want! If so, do something with it, and stop looping to avoid needless further processing because you said you only needed one particular file.

The other issue related to your question is that in the html file the csv hrefs are relative urls. So you have to join them on the base url of the document they're in. urlparse.urljoin() does just that.

Not related to the question directly, but you should also try to clean up your code;

fix your indentation (the comment on line 9)
choose better variable names; y/c/k/w are meaningless.

Resulting in something like:

import urllib
import urlparse

url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *


def scraper(url):
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html)
    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href = (tag.get('href', None))
        if href.endswith("/co2.csv"):
            csv_url = urlparse.urljoin(url, href)
            # ... do something with the csv file....
            contents = urllib.urlopen(csv_url).read()
            print "csv file size=", len(contents)
            break   # we only needed this one file, so we end the loop.

scraper(url)

answered Nov 9, 2016 at 23:22

Irmen de Jong

2,8771 gold badge17 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Cliff Over a year ago

yes my code is a first crude attempt which was used for another project that had nothing to do with this question. I'll look up relative url's as well.

Irmen de Jong Over a year ago

Then please in the future try not posting code that "has nothing to do with the question" - why should we even read that? Instead post relevant code

Cliff Over a year ago

Yes thanks. but I am glad you understood my question.

Collectives™ on Stack Overflow

how to extract specific csv from web page html containing multiple csv file links

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related