0

I work with a payment API and it returns some XML. For logging I want to save the API response in my database.

One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.

But when trying to save this in my DB I get:

Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f

How can I solve this?

I tried things like

response.encode("UTF-8") and also force_encode but all I get is:

Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)

I need to either remove this wrong character or convert it somehow.

4
  • 1
    Are you sure that "a payment API" is giving you UTF-8 at all? Commented Oct 19, 2020 at 18:09
  • @AmigoJack the api returns a XML in ISO-8859-1. My rails table field is a normal "character varying". I have other APIs that return UTF-8 and I need to store them all in the same column. So I need to convert the API response in some way to be able to save it in the DB. Commented Oct 19, 2020 at 18:15
  • The XML starts like this: "<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>" Commented Oct 19, 2020 at 18:54
  • It should be obvious: ISO-8859-1 is a different encoding than UTF-8 - you have to convert one to the other instead of passing it thru unhandled. Commented Oct 20, 2020 at 0:20

1 Answer 1

1

You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:

response.encode(‘UTF-8’, ‘ISO-8859-1’)

There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:

# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)

This is all laid out in the Ruby docs for String - check them out!

—--

Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.

As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).

Sign up to request clarification or add additional context in comments.

3 Comments

The source encoding is known and it has no invalid sequences - there should neither be the need to "drop", nor to force something. Just convert it and let exceptions occur, as none are expected.
force_encoding will help: response.force_encoding('ISO-8859-1').encode('UTF-8') for example.
@muistooshort - yes, you can use force_encoding in that way, my point was more that people use force_encoding thinking that it will somehow transcode the string when all it does it change what encoding the string is tagged as. I'll update the post for clarity around this point. Thanks! @AmigoJack - Good call - when both source/dest encodings are known, you shouldn't have to specify the replacement options. Of course, assuming the encoding is valid in the source encoding. I'll update the answer with a bit more context.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.