How to replace a string in a file without changing encoding

Question

I'm trying to replace a single character '°' with '?' in an edf file with binary encoding.(File) I need to change all occurances of it in the first line.

I cannot open it without specifying read binary. (The following fails with UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 3008: invalid start byte)

with open('heartbeat-baseline-700001.edf') as fin:
lines = fin.readlines()

I ended up trying to replace it via this code

with open("heartbeat-baseline-700001.edf", "rb") as text_file:
lines = text_file.readlines()
lines[1] = lines[1].replace(str.encode('°'), str.encode('?'))
for i,line in enumerate(lines):
    with open('heartbeat-baseline-700001_python.edf', 'wb') as fout:
        fout.write(line)

What I end up with is a file that is exponentially smaller (7KB vs 79MB) and does not work.

What seems to be the issue with this code? Is there a simpler way to replace the character?

In your for loop, you are just overwriting a single line to the file, not (i assume) appending. Try ab instead of wb? Or exchange for and with open so that the fout.write is done while the file is open — Chris
– Chris, Commented Jun 16, 2022 at 3:49
Put the for loop inside the second with statement, not the other way around — mozway
– mozway, Commented Jun 16, 2022 at 3:50
Providing an actual EDF file with the problem would enable better guidance. — Mark Tolonen
– Mark Tolonen, Commented Jun 16, 2022 at 5:19

zNaCly · Accepted Answer · 2022-06-16 04:56:01Z

When you're opening the file using 'wb' and writing, you are overwriting the entire file each time through. What you want is to control the write/read head, move it where you need it, and overwrite where needed.

A few changes will need to be made. First you need a reference to the open file. with open() I didn't test, so idk if it works - I'm sure it does. You can use the plain file = open('filepath') or test with open(), but you need the reference and if you don't use with open(), you'll need to explicitly call file.close() when you're done. Second you want to open the file using rb+, not wb so that you're not overwriting the file each time. Finally, you want to control the read/write head so that you can position it correctly for your reads/writes/overwrites.

def main():
    file = open("test.txt", "rb")
    filePos = 0
    
    while True:
        # Read the file character by character
        char = file.read(1)
        # When we find the char we want, break the loop and save the read/write head position.
        # Since we're in binary, we need to decode to get it to proper format for comparison (or encode the char)
        if char.decode('ascii') == "*":
            filePos = file.tell()
            break
        # If no more characters, we're at the end of the file. Break the loop and end the program.
        elif not char:
            break
       
    
    # Resolve open/unneeded file pointers.
    file.close()
    
    # Open the file in rb+ mode for writing without overwriting.
    fileWrite = open("test.txt", 'rb+')
    
    # Move the read/write head to the location we found our char at. 
    fileWrite.seek(filePos - 1)
    
    # Overwrite our char.
    fileWrite.write(bytes("?", "ascii"))
    
    # Close the file
    fileWrite.close()
    
if __name__ == "__main__":
    main()

**Starting file contents:** This is old data with this * weird symbol that needs replacing.

**Completed file contents:** This is old data with this ? weird symbol that needs replacing.

Mark Tolonen · Accepted Answer · 2022-06-16 05:24:18Z

2

The file is not 100% encoded text, so that's why it must be opened as binary. You can't use readlines() as the spec indicates space-padded fields in the ASCII fields, i.e., no newlines. Use 'r+b' to open for read/update of the binary file, seek() to the offset of the byte you want to replace, and write(b'?') the question mark byte.

Example:

# Create a small example file
with open("heartbeat-baseline-700001.edf", "wb") as f:
    f.write(b'ABCDEFG')

# Change a byte
with open("heartbeat-baseline-700001.edf", "r+b") as f:
    f.seek(3)     # offset of "D"
    f.write(b'?') # change to "?"

# read and display
with open("heartbeat-baseline-700001.edf", "rb") as f:
    print(f.read())

Output:

b'ABC?EFG'

Another issue is the '°' character. It's not an ASCII character which violates the spec (likely why you are replacing it), but what encoding is it using? It's probably ISO-8859-1 or Windows-1252...both of which encode as the byte b'\xb0'. but it could be UTF-8 which encodes as the two bytes b'\xc2\xb0'. Assuming the former, that byte could occur in any of the data record integer fields as well, so be careful to replace the correct byte. The UTF-8 pair could occur in the data fields as well, but less likely.

If you know the bad character is byte b'\xb0' and it occurs in the header before any data record entries, you could read the whole file, make a single replacement, and write the whole file back:

# read the whole file as binary
with open('heartbeat-baseline-700001.edf', 'rb') as f:
    data = f.read()

# replace the 1st 0xb0 byte found
data = data.replace(b'\xb0', b'?', 1)

# write the whole file back
with open('heartbeat-baseline-700001.edf', 'wb') as f:
    f.write(data)

edited Jun 16, 2022 at 5:24

answered Jun 16, 2022 at 4:41

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

4 Comments

Warren Manuel Over a year ago

Thank you for your response @Mark ! I'll be quite honest I'm a little overwhelmed with the encodings. I had to get permission to deidentify the records but you can find the deidentified file [here] (file.io/LcdyHq5NJ4cJ) What I do know is that the issue is with the header containing invalid characters at positions 3009/3017/3025 hence why I tried to update only the first line.

Mark Tolonen Over a year ago

@WarrenManuel The bad bytes are value B0 like in the last example. They are the first 3 B0 in the file, so you could use my last example but use 3 instead of 1 for the number to replace. You could even use data = data.replace(b'\xb0\x20\x20\x20', b'degC', 3) to replace the degree symbol and the space padding (B0 20 20 20) with degC if you wanted a more accurate unit than a question mark 🙂

Warren Manuel Over a year ago

Yes ! This actually worked. But if you could indulge once more. When I used str.encode('°') the output was b'\xc2\xb0' and not b'\xb0' as your solution pointed out. Any idea what I'm doing wrong here ?

Mark Tolonen Over a year ago

str.encode() defaults to UTF8 encoding. Use '°'.encode('latin1') instead

Collectives™ on Stack Overflow

How to replace a string in a file without changing encoding

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related