python socket http speed up

Question

im tring to create small application web crawler: im write this code :

def isGood(URL):
    try:
        cURL = URL + text.patch
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(3)
        sock.connect((URL, 80))
        header  = "GET %s HTTP/1.1\r\n" % text.patch
        header += "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36\r\n"
        header += "Accept: */*\r\n"
        header += "Host: %s\r\n\r\n" % URL
        sock.send(header)
        data = sock.recv(1024)
        html = ""
        for x in range(10):
            html = html + data
            data = sock.recv(1024)
            if len(data) == 0:
                break
        sock.close()
        if str(text.keyword) in html:
            print '+ ' + cURL
            logfile = open(text.output, 'a')
            logfile.write('%s\n' % (cURL))
            logfile.close()
            list_good.append(cURL)
    except:
        pass

the code is working, bat si very very slower, after connexion need close but socket not close and wait time out, how can speed up this ? im use in multi threads!

Why are you recreating urllib, and why are you calling recv 10 times? — loopbackbee
– loopbackbee, Commented Apr 29, 2014 at 17:24
i need use socket for use minimal pc resource, i use recv 10 time for print complete web page ! — kingcope
– kingcope, Commented Apr 29, 2014 at 18:02
urllib also uses socket, uses negligible resources and, IMHO, would be better suited to what you're trying to do. You should think about what happens if your webpage is smaller than 10*1024 bytes. — loopbackbee
– loopbackbee, Commented Apr 29, 2014 at 18:10
whit socket 500 threads use 1% of procesor . whit urllib 500 threads use 10-15% of procesor ... is same ? — kingcope
– kingcope, Commented Apr 29, 2014 at 18:19

Steffen Ullrich · Accepted Answer · 2014-04-29 18:47:05Z

1

Please not another broken attempt to write its own HTTP stack because the existing one is too slow. Just a few mistakes in your code:

You do HTTP/1.1 which implies persistent connections (e.g. Connection: keep-alive) unless you say otherwise.
This means, that you need to read the HTTP response header first (which you don't, probably to be faster) and then determine the length of the content, e.g. checking for a Transfer-Encoding chunked or a content-length (in this order).
If you don't do this you will just hang until the servers closes the connection because it does not like to wait any longer for your next request (keep-alive can do multiple requests on a single TCP connection). This is what slows you down here.
You could save you all this trouble by doing HTTP/1.0 requests with no keep-alive. But then you will need to have one TCP connection per request, which introduces lots of overhead and latency and thus will probably take longer than loading a proven HTTP library with all their felt overhead, but which can handle proper HTTP persistent connections. And it will probably take longer even you are doing it multithreaded.

So do yourself a favor and don't reinvent the wheel. But if you are still willing to do it yourself and try to be better than existing libraries (which is probably possible, although not much better) I recommend you to thoroughly study the specification of HTTP, e.g. at minimum RFC2616. Then lets see if you are able to do it better and faster, because usually: Those who don't understand XXX are condemned to reinvent it, poorly.

answered Apr 29, 2014 at 18:47

Steffen Ullrich

125k11 gold badges155 silver badges194 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

kingcope Over a year ago

i need only get webpage and close connexion, not create new library ! How can make this ? get www.site.com/robots.txt and close connexion instant !

Steffen Ullrich Over a year ago

You don't need to create a new library. There is a http library for python and it is easier to use this than to write your own. Apart from that I've mentioned some errors in your code, if you fix these you are on the right way.

Steffen Ullrich Over a year ago

A assume that you don't have IP addresses in the URL but host names. Did you measure how long it takes on your system to resolve all the host names you use? Apart from that, the content-length header is wrong multiple times: first you add it as a body, second a GET can not have a body and third because of that a GET does not need a content-length header.

Steffen Ullrich Over a year ago

Another bug: you expect the first recv(1024) to contain the header and the rest the HTML. But this is only the case if the header is less than 1024 bytes and it was send in a separate send by the server and maybe also that the server disabled NAGLE algorithm. If these are not the case you might be html in the perceived header or you might have header data in the perceived html part.

kingcope Over a year ago

this is not important, important after get page close connexion !

|

Collectives™ on Stack Overflow

python socket http speed up

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related