2

The following is the code, this code works fine and I get an output file with pipe as a delimiter. However, I do not want a new file to be generated rather I would like the existing file to be replaced with pipe delimiter instead of comma. Appreciate your inputs. I am new to python and learning it on the go.

with open(dst1,encoding='utf-8',errors='ignore') as input_file:
    with open(dst2, 'w',encoding='utf-8',errors='ignore', newline='') as output_file:
        reader = csv.DictReader(input_file, delimiter=',')
        writer = csv.DictWriter(output_file, reader.fieldnames,'uft-8', delimiter='|')
        writer.writeheader()
        writer.writerows(reader)
2
  • 1
    Well, if everything fits in memory, just keep the data and rewrite them after. If not, just do a temp_file Commented Sep 4, 2019 at 17:36
  • 2
    @snakecharmerb: Usually you'd do it the other way around; write a new file, then atomically replace the original file with the new file only when the new file has been completely written. Commented Sep 4, 2019 at 17:47

3 Answers 3

2

The only truly safe way to do this is to write to a new file, then atomically replace the old file with the new file. Any other solution risks data loss/corruption on power loss. The simple approach is to use the tempfile module to make a temporary file in the same directory (so atomic replace will work):

import os.path
import tempfile

with open(dst1, encoding='utf-8', errors='ignore', newline='') as input_file, \
     tempfile.NamedTemporaryFile(mode='w', encoding='utf-8', newline='',
                                 dir=os.path.dirname(dst1), delete=False) as tf:
    try:
        reader = csv.DictReader(input_file)
        writer = csv.DictWriter(tf, reader.fieldnames, delimiter='|')
        writer.writeheader()
        writer.writerows(reader)
    except:
        # On error, remove temporary before reraising exception
        os.remove(tf.name)
        raise
    else:
        # else is optional, if you want to be extra careful that all
        # data is synced to disk to reduce risk that metadata updates
        # before data synced to disk:
        tf.flush()
        os.fsync(tf.fileno())

# Atomically replace original file with temporary now that with block exited and
# data fully written
try:
    os.replace(tf.name, dst1)
except:
    # On error, remove temporary before reraising exception
    os.remove(tf.name)
    raise
Sign up to request clarification or add additional context in comments.

Comments

0

Since you are simply replacing a single-character delimiter from one to another, there will be no change in file size or positions of any characters not being replaced. As such, this is a perfect scenario to open the file in r+ mode to allow writing back the processed content to the very same file being read at the same time, so that no temporary file is ever needed:

with open(dst, encoding='utf-8', errors='ignore') as input_file, open(dst, 'r+', encoding='utf-8', errors='ignore', newline='') as output_file:
    reader = csv.DictReader(input_file, delimiter=',')
    writer = csv.DictWriter(output_file, reader.fieldnames, 'uft-8', delimiter='|')
    writer.writeheader()
    writer.writerows(reader)

EDIT: Please read @ShadowRanger's comment for limitations of this approach.

14 Comments

There's no guarantee the file size won't change actually. The default quoting rule for the csv module is csv.QUOTE_MINIMAL, which only quotes fields if they contain the delimiter, quote character or line terminator; if you change the delimiter from , to |, fields that previously required quoting due to embedded commas won't be quoted if they don't contain |. And if the script is killed partway through (for whatever reason; power loss, program crash, user hits Ctrl-C), you'll end up with a mix of new and old data.
Good point. I'll leave my answer here still just in case the OP's actual CSV files don't involve any quoted fields and just want something minimal. But I agree that this is not a robust solution in general.
Note: You could probably fix the file size issue (though not the problems with power loss/crash/Ctrl-C) by adding output_file.truncate() after the writerows call. Leaves the (relatively unlikely) possibility that the new file data is so much larger that it overwrites part of the file before you get around to buffering the data from the file, but at least it doesn't risk trailing garbage.
Both the solutions worked. However,I am convinced towards using ShadowRanger's solution Thank you for the help appreciate your time.
Hi again, On the same note, one of the column in my csv file has data with | in between the data. How shall I remove it.Thankyou!!
|
0

I'm not totally sure, but if the file is not too big, you can load the file in pandas using read_csv & then save it using your desired delimiter using to_csv function using whatever delimiter you like. For example -

import pandas as pd
data = pd.read_csv(input_file, encoding='utf-8')
data.to_csv(input_file, sep='|', encoding='utf-8')

Hope this helps!!

2 Comments

This doesn't replace the original file... And even if you change it to do so by passing input_file to to_csv as well, it does risk data corruption (since it will be rewriting the file in place by truncating it, then writing out the new data, and a crash partway through will lose data). Beyond that, if the OP isn't already using pandas, adding it as a dependency is a pretty heavyweight solution.
Yeah, I do agree with you. But I think it is a neat solution. Thanks for bringing it to my notice

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.