python: how to fetch an url? (with improper response headers)

Question

I want to build a small script in python which needs to fetch an url. The server is a kind of crappy though and replies pure ASCII without any headers.

When I try:

import urllib.request
response = urllib.request.urlopen(url)
print(response.read())

I obtain a http.client.BadStatusLine: 100 error because this isn't a properly formatted HTTP response.

Is there another way to fetch an url and get the raw content, without trying to parse the response?

Thanks

I'm using python 3, and urllib2 isn't installed by default there. I think it's for python2, but correct me if I'm wrong. To my understanding, the behavior would also be the same, as urllib2 also parses the response (feel free to correct me if I am mistaken). — dagnelies
– dagnelies, Commented Apr 11, 2012 at 14:27
Looks like urllib in python3.x is the same as urllib2 in python2.x. Have you tries making a URLopener object, then using one of its open methods (use help(urllib) to find out more) - while I don't have python3.x or access to the data you are testing against, the docs say nothing about headers on this, whereas the request method does, explicitly. The requests module is widely lauded though if it is useful for this. open_data, open_file, open_ftp, open_http, open_https — theheadofabroom
– theheadofabroom, Commented Apr 11, 2012 at 15:42
Disregard most of that last comment - I am mixing up content from urllib and urllib2 - just check the docs for what you have - it's generally fairly clear — theheadofabroom
– theheadofabroom, Commented Apr 11, 2012 at 15:47

Marty · Accepted Answer · 2012-04-12 16:50:30Z

It's difficult to answer your direct question without a bit more information; not knowing exactly how the (web) server in question is broken.

That said, you might try using something a bit lower-level, a socket for example. Here's one way (python2.x style, and untested):

#!/usr/bin/env python
import socket                                                                  
from urlparse import urlparse                                                  

def geturl(url, timeout=10, receive_buffer=4096):                              
    parsed = urlparse(url)                                                     
    try:                                                                       
        host, port = parsed.netloc.split(':')                                  
    except ValueError:                                                         
        host, port = parsed.netloc, 80                                         

    sock = socket.create_connection((host, port), timeout)                     
    sock.sendall('GET %s HTTP/1.0\n\n' % parsed.path)                          

    response = [sock.recv(receive_buffer)]                                     
    while response[-1]:                                                        
        response.append(sock.recv(receive_buffer))                             

    return ''.join(response)  

print geturl('http://www.example.com/') #<- the trailing / is needed if no 
                                            other path element is present

And here's a stab at a python3.2 conversion (you may not need to decode from bytes, if writing the response to a file for example):

#!/usr/bin/env python
import socket                                                                  
from urllib.parse import urlparse                                                  

ENCODING = 'ascii'

def geturl(url, timeout=10, receive_buffer=4096):                              
    parsed = urlparse(url)                                                     
    try:                                                                       
        host, port = parsed.netloc.split(':')                                  
    except ValueError:                                                         
        host, port = parsed.netloc, 80                                         

    sock = socket.create_connection((host, port), timeout)                     

    method  = 'GET %s HTTP/1.0\n\n' % parsed.path
    sock.sendall(bytes(method, ENCODING))

    response = [sock.recv(receive_buffer)]                                     
    while response[-1]:                                                        
        response.append(sock.recv(receive_buffer))                             

    return ''.join(r.decode(ENCODING) for r in response)

print(geturl('http://www.example.com/'))

HTH!

Edit: You may need to adjust what you put in the request, depending on the web server in question. Guanidene's excellent answer provides several resources to guide you on that path.

...this isn't working yet but definitely on the right track... thanks
Great! Feel free to share what's not working if you think we can help.
I just had to tweak a little the request header. The targeted server is a quite crazy beast. :)

Community · Accepted Answer · 2017-05-23 12:20:15Z

What you need to do in this case is send a raw HTTP request using sockets.
You would need to do a bit of low level network programming using the socket python module in this case. (Network sockets actually return you all the information sent by the server as it as, so you can accordingly interpret the response as you wish. For example, the HTTP protocol interprets the response in terms of standard HTTP headers - GET, POST, HEAD, etc. The high-level module urllib hides this header information from you and just returns you the data.)

You also need to have some basic information about HTTP headers. For your case, you just need to know about the GET HTTP request. See its definition here - http://djce.org.uk/dumprequest, see an example of it here - http://en.wikipedia.org/wiki/HTTP#Example_session. (If you wish to capture live traces of HTTP requests sent from your browser, you would need a packet sniffing software like wireshark.)

Once you know basics about socket module and HTTP headers, you can go through this example - http://coding.debuntu.org/python-socket-simple-tcp-client which tells you how to send a HTTP request over a socket to a server and read its reply back. You can also refer to this unclear question on SO.

(You can google python socket http to get more examples.)

(Tip: I am not a Java fan, but still, if you don't find enough convincing examples on this topic under python, try finding it under Java, and then accordingly translate it to python.)

user850498 · Accepted Answer · 2012-04-11 14:25:05Z

0

urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')

answered Apr 11, 2012 at 14:25

user850498

7271 gold badge9 silver badges22 bronze badges

4 Comments

dagnelies Over a year ago

also parses the response and results in a http.client.BadStatusLine: 100

dagnelies Over a year ago

it's corporate stuff, sry (but i can see the output when pasting the url in firefox for instance)

user850498 Over a year ago

urlretrieve should just put what the server send in to a file. change 'abc.jpg' to 'abc.txt'

dagnelies Over a year ago

Well, what should I say, it doesn't! :p ...it checks the response and results in a urllib.error.URLError: <urlopen error http protocol error: bad status line>

Collectives™ on Stack Overflow

python: how to fetch an url? (with improper response headers)

3 Answers 3

3 Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related