Python (requests) - incorrect encoding when fetching headers

Question

I am using requests library (python 3.9) to get filename from URL.^[1] For some reason a file name is incorrectly encoded. I should get "Ogłoszenie_0320.pdf" instead of "OgÅ\x82oszenie_0320.pdf".

My code looks something like this:

import requests
import re

def getFilenameFromRequest(url : str, headers):
    # Parses from header information
    contentDisposition = headers.get('content-disposition')
    if contentDisposition:
        filename = re.findall('filename=(.+)', contentDisposition)
        print("oooooooooo: " + contentDisposition + " : " + str(filename))
        if len(filename) != 0:
            return filename[0]

    # Parses from url
    parsedUrl = urlparse(url)
    return os.path.basename(parsedUrl.path)

def getFilenameFromUrl(url : str):
    request = requests.head(url)
    headers = request.headers
    return getFilenameFromRequest(url, headers)

getFilenameFromUrl('https://przedszkolekw.bip.gov.pl'+
    '/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html')

Any idea how to fix it? I know for standard request I can set encoding directly:

request.encoding = 'utf-8'

But what am I supposed to do with this case?

^[1] https://przedszkolekw.bip.gov.pl/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html

There's a library "fix that for you" ftfy.readthedocs.io/en/latest that might help you if you can't solve this the proper way. It automatically fixes problems like this so you don't have to. — 576i
– 576i, Commented May 19, 2021 at 9:39
I applied the changes needed to make it reproducible. It's still not a minimal-reproducible-example - this is a one-time service. Please keep in mind that you should care about the work needed by others. Thanks! — Wolf
– Wolf, Commented May 19, 2021 at 12:57
How to create a Minimal, Reproducible Example - Help Center - Stack Overflow is of course easier to read than the collection linked in above comment. — Wolf
– Wolf, Commented May 19, 2021 at 13:21

Lucas Scott · Accepted Answer · 2021-05-19 12:48:10Z

3

Only characters from the ascii based latin-1 should be used as header values [rfc]. Here the file name has been escaped.

>>> s = "Ogłoszenie_0320.pdf"
>>> s.encode("utf8").decode("unicode-escape")
'OgÅ\x82oszenie_0320.pdf'

To reverse the process you can do

>>> sx = 'OgÅ\x82oszenie_0320.pdf'
>>> sx.encode("latin-1").decode("utf8")
'Ogłoszenie_0320.pdf'

(updated after conversation in comments)

edited May 19, 2021 at 12:48

answered May 19, 2021 at 9:57

Lucas Scott

4754 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

aneroid Over a year ago

@Lucas Also works with filename.encode('latin-1').decode('utf-8'). Btw, wrt "only ascii characters should be used as header values" - that's correct but there's clearly an Å in the header. I suspect it's because the server is incorrectly decoding the filename for the string version - and browsers know to handle this situation. +1

aneroid Over a year ago

@Wolf Yeah, I noticed that happening too. So instead, I went the long way round - with the filename copied to clipboard, did pd.read_clipboard().columns.values.tolist()[0].replace('"', '')

Wolf Over a year ago

If ASCII is a 7-bit encoding, there must be something wrong with the header.

Lucas Scott Over a year ago

@wolf I just tried this is a python shell in iTerm2. I have my shell encoding set to en_US.UTF8

Lucas Scott Over a year ago

@aneroid thanks for the suggestion for clearing up the language. The Å is valid extended ascii theasciicode.com.ar/extended-ascii-code/…

|

Collectives™ on Stack Overflow

Python (requests) - incorrect encoding when fetching headers

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related