0

im tring to create small application web crawler: im write this code :

def isGood(URL):
    try:
        cURL = URL + text.patch
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(3)
        sock.connect((URL, 80))
        header  = "GET %s HTTP/1.1\r\n" % text.patch
        header += "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36\r\n"
        header += "Accept: */*\r\n"
        header += "Host: %s\r\n\r\n" % URL
        sock.send(header)
        data = sock.recv(1024)
        html = ""
        for x in range(10):
            html = html + data
            data = sock.recv(1024)
            if len(data) == 0:
                break
        sock.close()
        if str(text.keyword) in html:
            print '+ ' + cURL
            logfile = open(text.output, 'a')
            logfile.write('%s\n' % (cURL))
            logfile.close()
            list_good.append(cURL)
    except:
        pass

the code is working, bat si very very slower, after connexion need close but socket not close and wait time out, how can speed up this ? im use in multi threads!

4
  • Why are you recreating urllib, and why are you calling recv 10 times? Commented Apr 29, 2014 at 17:24
  • i need use socket for use minimal pc resource, i use recv 10 time for print complete web page ! Commented Apr 29, 2014 at 18:02
  • urllib also uses socket, uses negligible resources and, IMHO, would be better suited to what you're trying to do. You should think about what happens if your webpage is smaller than 10*1024 bytes. Commented Apr 29, 2014 at 18:10
  • whit socket 500 threads use 1% of procesor . whit urllib 500 threads use 10-15% of procesor ... is same ? Commented Apr 29, 2014 at 18:19

1 Answer 1

1

Please not another broken attempt to write its own HTTP stack because the existing one is too slow. Just a few mistakes in your code:

  • You do HTTP/1.1 which implies persistent connections (e.g. Connection: keep-alive) unless you say otherwise.
  • This means, that you need to read the HTTP response header first (which you don't, probably to be faster) and then determine the length of the content, e.g. checking for a Transfer-Encoding chunked or a content-length (in this order).
  • If you don't do this you will just hang until the servers closes the connection because it does not like to wait any longer for your next request (keep-alive can do multiple requests on a single TCP connection). This is what slows you down here.
  • You could save you all this trouble by doing HTTP/1.0 requests with no keep-alive. But then you will need to have one TCP connection per request, which introduces lots of overhead and latency and thus will probably take longer than loading a proven HTTP library with all their felt overhead, but which can handle proper HTTP persistent connections. And it will probably take longer even you are doing it multithreaded.

So do yourself a favor and don't reinvent the wheel. But if you are still willing to do it yourself and try to be better than existing libraries (which is probably possible, although not much better) I recommend you to thoroughly study the specification of HTTP, e.g. at minimum RFC2616. Then lets see if you are able to do it better and faster, because usually: Those who don't understand XXX are condemned to reinvent it, poorly.

Sign up to request clarification or add additional context in comments.

8 Comments

i need only get webpage and close connexion, not create new library ! How can make this ? get www.site.com/robots.txt and close connexion instant !
You don't need to create a new library. There is a http library for python and it is easier to use this than to write your own. Apart from that I've mentioned some errors in your code, if you fix these you are on the right way.
A assume that you don't have IP addresses in the URL but host names. Did you measure how long it takes on your system to resolve all the host names you use? Apart from that, the content-length header is wrong multiple times: first you add it as a body, second a GET can not have a body and third because of that a GET does not need a content-length header.
Another bug: you expect the first recv(1024) to contain the header and the rest the HTML. But this is only the case if the header is less than 1024 bytes and it was send in a separate send by the server and maybe also that the server disabled NAGLE algorithm. If these are not the case you might be html in the perceived header or you might have header data in the perceived html part.
this is not important, important after get page close connexion !
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.