109

What is the best way to parse data out of a URL query string (for instance, data appended to the URL by a form) in python? My goal is to accept form data and display it on the same page. I've researched several methods that aren't quite what I'm looking for.

I'm creating a simple web server with the goal of learning about sockets. This web server won't be used for anything but testing purposes.

GET /?1pm=sample&2pm=&3pm=&4pm=&5pm= HTTP/1.1
Host: localhost:50000
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20100101 Firefox/11.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Referer: http://localhost:50000/?1pm=sample&2pm=&3pm=&4pm=&5pm=
4
  • Are you looking to write the parsing from scratch, or what? Commented Apr 11, 2012 at 20:11
  • 2
    What's wrong with stackoverflow.com/questions/1349367/… or stackoverflow.com/questions/4685217/parse-raw-http-headers. You haven't given us enough info about what other approaches are lacking. Do you have an example header or two? Commented Apr 11, 2012 at 20:12
  • Nothing is 'wrong' with either of these posts. Based on the programming experiences I've head in the past, I'm inclined to do something similar like a regex expression in the second link. However, I wanted to ask and see if there is a simpler way to do it since this is my first python program. Commented Apr 11, 2012 at 20:24
  • Looks to me like you're talking about URL query strings, not HTTP headers. You might want to update your question to reflect this. Commented Apr 11, 2012 at 20:57

6 Answers 6

126

Here is an example using python3 urllib.parse:

from urllib.parse import urlparse, parse_qs
URL='https://someurl.com/with/query_string?i=main&mode=front&sid=12ab&enc=+Hello'
parsed_url = urlparse(URL)
parse_qs(parsed_url.query)

output:

{'i': ['main'], 'enc': [' Hello '], 'mode': ['front'], 'sid': ['12ab']}

Note for python2: from urlparse import urlparse, parse_qs

SEE: https://pythonhosted.org/six/#module-six.moves.urllib.parse

Sign up to request clarification or add additional context in comments.

2 Comments

And why are the values like this ['value'] ? dic['enc'] gets ['Hello'] - how to get 'Hello'? with split?
@Suisse see stackoverflow.com/questions/11447391/… the values are in a list because multiple values can be encoded see : stackoverflow.com/questions/2571145/… hope it helps
54

The urllib.parse module is your friend: https://docs.python.org/3/library/urllib.parse.html

Check out urllib.parse.parse_qs (parsing a query-string, i.e. form data sent to server by GET or form data posted by POST, at least for non-multipart data). There's also cgi.FieldStorage for interpreting multipart-data.

For parsing the rest of an HTTP interaction, see RFC2616, which is the HTTP/1.1 protocol specification.

10 Comments

I'm not writing the script for him. He specifically asked how to parse query data, at least that's what I read between the lines, even though those are not actually HTTP headers. But I didn't bother commenting on that.
I'm not suggesting that you should write the script for him, but urlparse is only a tiny piece of this puzzle.
For the amount of information he gave, that's all there is to say. Specifically, if you're actually referring to HTTP headers: is he using a webserver which actually allows you to get HTTP headers uninterpreted (via some stream)? Is he using WSGI (where HTTP-headers are interpreted by the framework)? Plain-old CGI, where you have to interpret the environment and hope for the best? Whatever.
urlparse looks like a great resource. The header is pretty simple and I've added it to the original question. As I'm sure you can guess, my initial idea is to parse the get line into an array of strings.
Are you trying to write a webserver? Or some form of packet inspection/inspector?
|
35

If you need unique key from query string, use dict() with parse_qsl()

import urllib.parse
urllib.parse.urlparse('https://someurl.com/with/query_string?a=1&b=2&b=3').query
    a=1&b=2&b=3
urllib.parse.parse_qs('a=1&b=2&b=3');
    {'a': ['1'], 'b': ['2','3']}
urllib.parse.parse_qsl('a=1&b=2&b=3')
    [('a', '1'), ('b', '2'), ('b', '3')]
dict(urllib.parse.parse_qsl('a=1&b=2&b=3'))
    {'a': '1', 'b': '3'}

1 Comment

It's important to notice that the casting from tuple to dict result don't consider b to have two values, one which gets ignored. Wasn't aware of parse_qsl, good addition.
11

built into python 2.7

>>> from urlparse import parse_qs
>>> parse_qs("search=quint&tags=python")
{'search': ['quint'], 'tags': ['python']}

Comments

1

only for one line quick prototyping CGI vars without imports, not the best obviously but could be useful.

agrs = dict(item.split('=') for item in env['QUERY_STRING'].split('&') if item)

3 Comments

This will break if any parameter in the query string is URL-encoded. "Manual parsing" of URLs is the source of many security issues.
indeed why the warning "only for prototyping" posted it to show case a quick parsing without any import
I wonder if every URL parser is a "manual parser"? At some point someone had to sit down and write it...
1

Based on this article, use can_ada to parse URLs in Python:

from their project:

import can_ada
urlstring = "https://www.GOoglé.com/./path/../path2/"
url = can_ada.parse(urlstring)
# prints www.xn--googl-fsa.com, the correctly parsed domain name according
# to WHATWG
print(url.hostname)
# prints /path2/, which is the correctly parsed pathname according to WHATWG
print(url.pathname)

import urllib.parse
urlstring = "https://www.GOoglé.com/./path/../path2/"
url = urllib.parse.urlparse(urlstring)
# prints www.googlé.com
print(url.hostname)
# prints /./path/../path2/
print(url.path)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.