Parsing a non-Unicode string with Flask-RESTful

Question

I have a webhook developed with Flask-RESTful which gets several parameters with POST. One of the parameters is a non-Unicode string, encoded in cp1251.

Can't find a way to correctly parse this argument using reqparse.

Here is the fragment of my code:

parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()

Then, I write msg to a text file, and it looks like this:

{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}

As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.

Anything I can do to advise RequestParser with the string encoding?

Here is my code for writing the text to disk:

 f = open('log_msg.txt', 'w+')
 f.write(json.dumps(msg))
 f.close()

I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.

Then, I tried

 f = open('log_msg_ascii.txt', 'w+')
 f.write(ascii(json.dumps(msg)))

Also, no difference.

So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.

Thanks!

U+FFFD is the replacement character. Can you show the code for writing the text to disk? Also, please show the output of ascii(msg), to see if the problem happens at parsing or during writing. — lenz
– lenz, Commented Feb 15, 2020 at 20:31
Have you seen that the reqparse module is deprecated? If this is a bug in reqparse, it might not get fixed. But I suspect that the cyrillic characters get replaced with U+FFFD at an earlier stage. You should inspect (and maybe add here) how the value looks directly on the flask.request object and how the payload is sent by the client. — lenz
– lenz, Commented Feb 18, 2020 at 9:06
@lenz I saved the raw request using request.get_data(), there is correct data. So I assume it's a reqparse bug... Yes, now I see that it's nearly deprecated. Will try to parse raw data or find another way. Thanks for your help! — Ildar Akhmetov
– Ildar Akhmetov, Commented Feb 18, 2020 at 21:37
@lenz I just made it work, posted the solution as an answer here. — Ildar Akhmetov
– Ildar Akhmetov, Commented Feb 18, 2020 at 23:26

Ildar Akhmetov · Accepted Answer · 2020-02-18 23:26:18Z

1

Okay, I finally found a workaround. Thanks to @lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).

So, to access that non-Unicode field, I did the following trick.

First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.

 raw_data = request.get_data()
 contents = raw_data.decode('windows-1251')
 match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
 text = match.group(2)

Not the most beautiful solution, but it works.

answered Feb 18, 2020 at 23:26

Ildar Akhmetov

1,44115 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

lenz Over a year ago

Happy to see it works. I still think you should make sure it's not the client that is declaring the wrong encoding, and fix it on that end if you can.

Collectives™ on Stack Overflow

Parsing a non-Unicode string with Flask-RESTful

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related