How to decode the gzip compressed data returned in a HTTP Response in python?

Question

I have created a client/server architecture in python, I take HTTP request from the client which is served by requesting another HTTP server through my code.

When I get the response from the third server I am not able to decode the gzip compressed data, I first split the response data using \r\n as separation character which got me the data as the last item in the list then I tried decompressing it with

zlib.decompress(data[-1])

but it is giving me an error of incorrect headers. How should I go with this problem ?

Code

client_reply = ''
                 while 1:
                     chunk = server2.recv(512)
                     if len(chunk) :
                         client.send(chunk)
                         client_reply += chunk
                     else:
                         break
                 client_split = client_reply.split("\r\n")
                 print client_split[-1].decode('zlib')

I want to read the data that is been transferred between the client and the 2nd server.

Show us the code! Are you sure the data hasn't been encoded/decoded improperly (i.e. it should be treated as binary data)? — Cameron
– Cameron, Commented Mar 18, 2012 at 20:35
Could be that your data is split into multiple chunks and you need to parse header to get the right length. The gzipped header has length information — Jens Munk
– Jens Munk, Commented Apr 3, 2016 at 17:21
what if the compressed data itself got "\r\n" in it, and you break it and decode only part of it instead of all the compressed data? I'd try to find "\r\n" in the server before you send it to validate if its the problem. — Ronen Ness
– Ronen Ness, Commented Apr 4, 2016 at 9:46

Community · Accepted Answer · 2021-10-07 08:14:10Z

Specify the wbits when using zlib.decompress(string, wbits, bufsize) see end of "troubleshooting" for example.

Troubleshooting

Lets start out with a a curl command that downloads a byte-range response with an unknown "content-encoding" (note: we know before hand it is some sort of compressed thing, mabye deflate maybe gzip):

export URL="https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-18/segments/1461860106452.21/warc/CC-MAIN-20160428161506-00007-ip-10-239-7-51.ec2.internal.warc.gz"
curl -r 266472196-266527075 $URL | gzip -dc | tee hello.txt

With the following response headers:

HTTP/1.1 206 Partial Content
x-amz-id-2: IzdPq3DAPfitkgdXhEwzBSwkxwJRx9ICtfxnnruPCLSMvueRA8j7a05hKr++Na6s
x-amz-request-id: 14B89CED698E0954
Date: Sat, 06 Aug 2016 01:26:03 GMT
Last-Modified: Sat, 07 May 2016 08:39:18 GMT
ETag: "144a93586a13abf27cb9b82b10a87787"
Accept-Ranges: bytes
Content-Range: bytes 266472196-266527075/711047506
Content-Type: application/octet-stream
Content-Length: 54880
Server: AmazonS3

So to the point.

Lets display the hex output of the first 10 bytes: curl -r 266472196-266472208 $URL | xxd

hex output:

0000000: 1f8b 0800 0000 0000 0000 ecbd eb

We can see some basics of what we are working with with the hex values.

Roughly meaning its probably a gzip ( 1f8b ) using deflate ( 0800 ) without a modification time ( 0000 0000 ), or any extra flags set ( 00 ), using a fat32 system( 00 ).

Please refer to section 2.3 / 2.3.1: https://www.rfc-editor.org/rfc/rfc1952#section-2.3.1

So onto the python:

>>> import requests
>>> url = 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-18/segments/1461860106452.21/warc/CC-MAIN-20160428161506-00006-ip-10-239-7-51.ec2.internal.warc.gz'
>>> response = requests.get(url, params={"range":"bytes=257173173-257248267"})
>>> unknown_compressed_data = response.content

notice anything similar?:

>>> unknown_compressed_data[:10]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00'

And on to the decompression let's just try at random based on the (documentation):

>>> import zlib

"zlib.error: Error -2 while preparing to decompress data: inconsistent stream state":

>>> zlib.decompress(unknown_compressed_data, -31)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zlib.error: Error -2 while preparing to decompress data: inconsistent stream state

"Error -3 while decompressing data: incorrect header check":

>>> zlib.decompress(unknown_compressed_data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check

"zlib.error: Error -3 while decompressing data: invalid distance too far back":

>>> zlib.decompress(unknown_compressed_data, 30)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: invalid distance too far back

Possible solution:

>>> zlib.decompress(unknown_compressed_data, 31)
'WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2016-04-28T20:14:16Z\r\nWARC-Record-ID: <urn:uu

Zbyněk Winkler · Accepted Answer · 2016-04-07 12:46:38Z

1

According to https://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html the headers and the body are separated by an empty line containing only CRLF characters. You could try

client_split = client_reply.split("\r\n\r\n",1)
print client_split[1].decode('zlib')

The split finds the empty line and the additional parameter limits the number of splits - the result being array with two items, headers and body. But it is hard to recommend anything without knowing more about your code and the actual string being split.

answered Apr 7, 2016 at 12:46

Zbyněk Winkler

1,5151 gold badge14 silver badges13 bronze badges

2 Comments

Ed_ Over a year ago

this produces "zlib is not a text encoding"

Zbyněk Winkler Over a year ago

well, at the time, it was python 2 code where such encoding existed - for python 3 docs.python.org/3/library/zlib.html#zlib.decompress needs to be used

Collectives™ on Stack Overflow

How to decode the gzip compressed data returned in a HTTP Response in python?

2 Answers 2

Troubleshooting

Possible solution:

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Troubleshooting

Possible solution:

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related