0

I have a webhook developed with Flask-RESTful which gets several parameters with POST. One of the parameters is a non-Unicode string, encoded in cp1251.

Can't find a way to correctly parse this argument using reqparse.

Here is the fragment of my code:

parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()

Then, I write msg to a text file, and it looks like this:

{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}

As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.

Anything I can do to advise RequestParser with the string encoding?

Here is my code for writing the text to disk:

 f = open('log_msg.txt', 'w+')
 f.write(json.dumps(msg))
 f.close()

I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.

Then, I tried

 f = open('log_msg_ascii.txt', 'w+')
 f.write(ascii(json.dumps(msg)))

Also, no difference.

So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.

Thanks!

5
  • U+FFFD is the replacement character. Can you show the code for writing the text to disk? Also, please show the output of ascii(msg), to see if the problem happens at parsing or during writing. Commented Feb 15, 2020 at 20:31
  • I just updated the question. with additional info. Commented Feb 18, 2020 at 1:03
  • Have you seen that the reqparse module is deprecated? If this is a bug in reqparse, it might not get fixed. But I suspect that the cyrillic characters get replaced with U+FFFD at an earlier stage. You should inspect (and maybe add here) how the value looks directly on the flask.request object and how the payload is sent by the client. Commented Feb 18, 2020 at 9:06
  • @lenz I saved the raw request using request.get_data(), there is correct data. So I assume it's a reqparse bug... Yes, now I see that it's nearly deprecated. Will try to parse raw data or find another way. Thanks for your help! Commented Feb 18, 2020 at 21:37
  • 1
    @lenz I just made it work, posted the solution as an answer here. Commented Feb 18, 2020 at 23:26

1 Answer 1

1

Okay, I finally found a workaround. Thanks to @lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).

So, to access that non-Unicode field, I did the following trick.

First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.

 raw_data = request.get_data()
 contents = raw_data.decode('windows-1251')
 match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
 text = match.group(2)

Not the most beautiful solution, but it works.

Sign up to request clarification or add additional context in comments.

1 Comment

Happy to see it works. I still think you should make sure it's not the client that is declaring the wrong encoding, and fix it on that end if you can.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.