0

I am using requests library (python 3.9) to get filename from URL.[1] For some reason a file name is incorrectly encoded. I should get "Ogłoszenie_0320.pdf" instead of "OgÅ\x82oszenie_0320.pdf".

My code looks something like this:

import requests
import re

def getFilenameFromRequest(url : str, headers):
    # Parses from header information
    contentDisposition = headers.get('content-disposition')
    if contentDisposition:
        filename = re.findall('filename=(.+)', contentDisposition)
        print("oooooooooo: " + contentDisposition + " : " + str(filename))
        if len(filename) != 0:
            return filename[0]

    # Parses from url
    parsedUrl = urlparse(url)
    return os.path.basename(parsedUrl.path)

def getFilenameFromUrl(url : str):
    request = requests.head(url)
    headers = request.headers
    return getFilenameFromRequest(url, headers)

getFilenameFromUrl('https://przedszkolekw.bip.gov.pl'+
    '/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html')

Any idea how to fix it? I know for standard request I can set encoding directly:

request.encoding = 'utf-8'

But what am I supposed to do with this case?


[1] https://przedszkolekw.bip.gov.pl/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html

4
  • There's a library "fix that for you" ftfy.readthedocs.io/en/latest that might help you if you can't solve this the proper way. It automatically fixes problems like this so you don't have to. Commented May 19, 2021 at 9:39
  • 1
    OS: Windows 10, language: polish Commented May 19, 2021 at 10:37
  • I applied the changes needed to make it reproducible. It's still not a minimal-reproducible-example - this is a one-time service. Please keep in mind that you should care about the work needed by others. Thanks! Commented May 19, 2021 at 12:57
  • How to create a Minimal, Reproducible Example - Help Center - Stack Overflow is of course easier to read than the collection linked in above comment. Commented May 19, 2021 at 13:21

1 Answer 1

3

Only characters from the ascii based latin-1 should be used as header values [rfc]. Here the file name has been escaped.

>>> s = "Ogłoszenie_0320.pdf"
>>> s.encode("utf8").decode("unicode-escape")
'OgÅ\x82oszenie_0320.pdf'

To reverse the process you can do

>>> sx = 'OgÅ\x82oszenie_0320.pdf'
>>> sx.encode("latin-1").decode("utf8")
'Ogłoszenie_0320.pdf'

(updated after conversation in comments)

Sign up to request clarification or add additional context in comments.

6 Comments

@Lucas Also works with filename.encode('latin-1').decode('utf-8'). Btw, wrt "only ascii characters should be used as header values" - that's correct but there's clearly an Å in the header. I suspect it's because the server is incorrectly decoding the filename for the string version - and browsers know to handle this situation. +1
@Wolf Yeah, I noticed that happening too. So instead, I went the long way round - with the filename copied to clipboard, did pd.read_clipboard().columns.values.tolist()[0].replace('"', '')
If ASCII is a 7-bit encoding, there must be something wrong with the header.
@wolf I just tried this is a python shell in iTerm2. I have my shell encoding set to en_US.UTF8
@aneroid thanks for the suggestion for clearing up the language. The Å is valid extended ascii theasciicode.com.ar/extended-ascii-code/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.