1

I have written a short function in Python3 to parse HTTP headers. I was wondering if anyone would be able to take a look at it and tell me if there is anything that I could have done differently to make the code better. What I have currently produces the required outcome but I am not sure if there would be any situation in which this code would not produce the desired result.

This is what I have:

def _parse_headers(self, headers):
  lines = headers.split("\r\n")
  info = lines[0].split(" ")

  method = None
  path = None
  protocol = None
  headers = {}

  if len(info) > 0:
    method = info[0]
  if len(info) > 1:
    path = info[1]
  if len(info) > 2:
    protocol = info[2]

  for line in lines[1:]:
    if line:
      parts = line.split(":")
      key = None
      value = None
      if len(parts) > 0:
        key = parts[0]
      if len(parts) > 1:
        value = parts[1]
      if not key is None and not value is None:
        headers[key.strip().upper()] = value.strip()

  return {
    "method": method,
    "path": path,
    "protocol": protocol,
    "headers": headers
  }
3
  • This answer gives a nice way of parsing the headers using methods from the standard library. Use it instead of rolling your own code. Commented Sep 11, 2014 at 18:40
  • I can see some problems here. This does not properly handle headers that span multiple lines, and does not properly handle headers whose values contain a : character. There is also the issue of only recognizing \r\n line breaks, although \n line breaks are not strictly conformant, you should either explicitly accept or reject them. Commented Sep 11, 2014 at 19:06
  • I agree with the other posters who recommend using an existing parsing library. But if you do want to "roll your own" you can eliminate that triple if construction with this hack: method, path, protocol = (info + 3*[None])[:3]. But it is a hack. :) Commented Sep 11, 2014 at 20:06

1 Answer 1

1

As noted by André in the comments, parsing HTTP is not to be taken lightly, unless as an exercise. In real programs you should generally stick to existing, mature implementations if possible.

Note that beyond the overall message structure, every header has its own peculiar internal structure, and you will often need to parse that too; Werkzeug can help there.

The obvious specific problems with your code are:

  • given a header Host: www.example.com:80, it will return www.example.com as its value;
  • given multiple headers with the same name, it will only return the value of the last one.
Sign up to request clarification or add additional context in comments.

2 Comments

I've fixed the first bullet point, but for the second one how would I tackle that?
@TechnoCF Use data structures similar to those for email headers, as that’s the origin of this message format. See the standard http.server.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.