10

As of Pandas 0.19.2, the function read_csv() can be passed a URL. See, for example, from this answer:

import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

The URL I'd like to use is: https://moz.com/top500/domains/csv

With the above code, this URL returns an error:

urllib2.HTTPError: HTTP Error 403: Forbidden

based on this post, I can get a valid response by passing a request header:

import urllib2,cookielib

site= "https://moz.com/top500/domains/csv"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

req = urllib2.Request(site, headers=hdr)

try:
    page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print (e.fp.read())

content = page.read()
print (content)

Is there any way to use the web URL functionality of Pandas read_csv(), but also pass a request header to make the request go through?

2 Answers 2

15

I would recommend you using the requests and the io library for your task. The following code should do the job:

import pandas as pd
import requests
from io import StringIO

url = "https://moz.com:443/top500/domains/csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)

df = pd.read_csv(data)
print(df)

(If you want to add a custom header just modify the headers variable)

Hope this helps

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks - I was not aware of the IO package previously. If possible, could you explain what the advantage of putting req.text into StringIO vs. reading the url directly with pandas like df = pd.read_csv(url) - actually I see you edited the question to reflect the new pandas version - do you believe that is the more efficient way?
@thesimplevoodoo Hey, the reason why I'm using StringIO here is that pd.read_csv() is expecting a filepath so giving it url or any other string including (req.text) would yield an error. By having data = StringIO(req.text) I can then use data as a file path (Do note that StringIO doesn't create any actual files but gives you the chance to read and write strings as files)
This is a nice solution, though it should probably not be an accepted answer. It does not answer the OP's question: "Is there any way to use the web URL functionality of Pandas read_csv(), but also pass a request header to make the request go through?" I'm personally much more interested in the question as it pertained to read_csv and potential header usage.
13

As of pandas 1.3.0, you can now pass custom HTTP(s) headers using storage_options argument:

url = "https://moz.com:443/top500/domains/csv"

hdr = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}

domains_df = pd.read_csv(url, storage_options=hdr)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.