1

I need to extract csv file from html page see below and once I get that I can do stuff with it. below is code to extract that particular line of html code from a previous assignment. The url is 'https://vincentarelbundock.github.io/Rdatasets/datasets.html' that is test code so it breaks temporarly when it finds that line. part of the line with my csv is href is csv/datasets/co2.csv ( unicode I think as type)

how to open the co2.csv? sorry about any formatting issues with the question. The code has been sliced and diced by the editor.

import urllib
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *

def scrapper(url,k):
    c=0
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html)
#.    Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        y= (tag.get('href', None))
        #print ((y))
        if y == 'csv/datasets/co2.csv':
            print y
            break
        c= c+ 1

        if c is k:
            return y
            print(type(y))

for w in range(29):
    print(scrapper(url,w))
3
  • Please improve your question: it is not clear if you only want to process the single co2.csv file, or if you want to process all csv files that are linked on the html page. Commented Nov 9, 2016 at 22:59
  • I only want the one file. Co2.csv Commented Nov 9, 2016 at 23:05
  • And I want to take that file and do data analysis on it (linear regression). Commented Nov 9, 2016 at 23:06

1 Answer 1

1

You're re-downloading and reparsing the full html page for all of the 30 iterations of your loop, just to get the next csv file and see if that is the one you want. That is very inefficient, and not very polite to the server. Just read the html page once, and use the loop over the tags you already had to check if the tag is the one you want! If so, do something with it, and stop looping to avoid needless further processing because you said you only needed one particular file.

The other issue related to your question is that in the html file the csv hrefs are relative urls. So you have to join them on the base url of the document they're in. urlparse.urljoin() does just that.

Not related to the question directly, but you should also try to clean up your code;

  • fix your indentation (the comment on line 9)
  • choose better variable names; y/c/k/w are meaningless.

Resulting in something like:

import urllib
import urlparse

url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *


def scraper(url):
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html)
    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href = (tag.get('href', None))
        if href.endswith("/co2.csv"):
            csv_url = urlparse.urljoin(url, href)
            # ... do something with the csv file....
            contents = urllib.urlopen(csv_url).read()
            print "csv file size=", len(contents)
            break   # we only needed this one file, so we end the loop.

scraper(url)
Sign up to request clarification or add additional context in comments.

3 Comments

yes my code is a first crude attempt which was used for another project that had nothing to do with this question. I'll look up relative url's as well.
Then please in the future try not posting code that "has nothing to do with the question" - why should we even read that? Instead post relevant code
Yes thanks. but I am glad you understood my question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.