0

Loading a file with phone numbers stored in txt file. When printing loaded file looks good. Writing the file into a list. When printing the file from a list get a different encoding type, not string. When writing contents of list to a new file get unnecessary \n and other characters despite stripping and ensuring UTF-8 format.

original_file = open("original.txt", "r", encoding="UTF-8", errors="replace")
pl = []
for item in original_file:
    pl.append(item)
target_file = open("target.txt", "w", encoding="UTF-8")
for item in pl:
    target_file.write(item) # or .write("{}\n".format(item)) 
                            # neither gets me the desired new lin

e

original file contents:

(248) 370-0000
(706) 862-2128
(863) 763-8632
(682) 404-0051
(734) 667-2877
...

when loaded to the pl list and print the item

for item in pl: print(item)

I get this:

(248) 370-0000
(706) 862-2128
(863) 763-8632
(682) 404-0051
(734) 667-2877

but when I simply write the list name pl I get this:

'\x00(\x006\x001\x000\x00)\x00 \x003\x009\x002\x00-\x003\x001\x001\x005\x00\n', '\x00(\x002\x001\x004\x00)\x00 \x009\x004\x001\x00-\x003\x008\x004\x001\x00\n', '\x00(\x003\x000\x004\x00)\x00 \x002\x001\x006\x00-\x002\x000\x009\x006\x00\n', '\x00(\x007\x002\x004\x00)\x00 \x003\x003\x007\x00-\x003\x005\x000\x004\x00\n', '\x00(\x002\x004\x008\x00)\x00 \x003\x007\x000\x00-\x000\x000\x000\x000\x00\n', '\x00(\x007\x000\x006\x00)\x00 \x008\x006\x002\x00-\x002\x001\x002\x008\x00\n', '\x00(\x008\x006\x003\x00)\x00 \x007\x006\x003\x00-\x008\x006\x003\x002\x00\n', '\x00(\x006\x008\x002\x00)\x00 \x004\x000\x004\x00-\x000\x000\x005\x001\x00\n', '\x00(\x007\x003\x004\x00)\x00 \x006\x006\x007\x00-\x002\x008\x007\x007\x00']

And I bring this up because when I then try to load the items from pl and write them to the target file instead of getting a list of phone numbers in a new text file I get this:

�3�9�2�-�3�1�1�5��(�2�1�4�)� �9�4�1�-�3�8�4�1��(�3�0�4�)� �2�1�6�-�2�0�9�6��(�7�2�4�)� �3�3�7�-�3�5�0�4��(�2�4�8�)� �3�7�0�-�0�0�0�0��(�7�0�6�)� �8�6�2�-�2�1�2�8��(�8�6�3�)� �7�6�3�-�8�6�3�2��(�6�8�2�)� �4�0�4�-�0�0�5�1��(�7�3�4�)� �6�6�7�-�2�8�7�7�

No new lines. Spaces between items instead.

1

1 Answer 1

1

Your original file is encoded as UTF-16, big endian.

>>> bs = b'\x00(\x006\x001\x000\x00)\x00 \x003\x009\x002\x00-\x003\x001\x001\x005\x00\n'
>>> bs.decode('utf-8')
'\x00(\x006\x001\x000\x00)\x00 \x003\x009\x002\x00-\x003\x001\x001\x005\x00\n'
>>> bs.decode('utf-16-be')
'(610) 392-3115\n'

(The presence of a null byte b'\x00' before each ascii character is a strong hint that utf-16 is the encoding)

Opening the file like this ought to work:

original_file = open("original.txt", "r", encoding="utf-16-be")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.