Issue with Find and Replace using Python

Question

I have a set of .csv files with ; delimiter. There are certain junk values in the data that I need to replace with blanks. A sample problem row is:

103273;CAN D MAT;B.C.;B.C.;B.C.;03-Apr-2006

Desired row after find and replace is:

103273;CAN D MAT;;;;03-Apr-2006

In the above example I'm replacing ;B.C.; with ;;

I cannot replace just B.C. as I need to match the entire cell value for this particular error case. The code that I am using is:

import os, fnmatch

def findReplace(directory, filePattern):
        for path, dirs, files in os.walk(os.path.abspath(directory)):
            for filename in fnmatch.filter(files, filePattern):
                filepath = os.path.join(path, filename)
                with open(filepath) as f:
                    s = f.read()
                for [find, replace] in zip([';#DIV/0!;',';B.C.;'],[';;',';;']        
                    s = s.replace(find, replace)
                with open(filepath, "w") as f:
                    f.write(s)

findReplace(*Path*, "*.csv")

The output that I'm instead getting is:

103273;CAN D MAT;;B.C.;;03-Apr-2006

Can someone please help with this issue?

Thanks in advance!

so basically you want to replace #DIV/0! and B.C. with `` (empty string). Why not just do that? With straight forward approach. — bergerg
– bergerg, Commented Oct 7, 2017 at 12:42
The posted program will give '103273;CAN D MAT;;;;;;;03-Apr-2006' for the example input, which is different from what you wrote. — janos
– janos, Commented Oct 7, 2017 at 12:43
@nutmeg: I also have the phrase B.C. coming in elsewhere (as a part of a string in a cell). I just wish to replace where the entire cell value matched this. Also, these two values are just representative. I have about 20 other values to replace such as "January," with "January". Also, I am new to python so not really sure what you mean by straight forward approach. — Sagar Joshi
– Sagar Joshi, Commented Oct 7, 2017 at 13:56
@janos: Thanks for pointing out. Had missed out a semi-colon in the Find string. — Sagar Joshi
– Sagar Joshi, Commented Oct 7, 2017 at 13:58
@SagarJoshi I probably misunderstood you. In that case, trentcl's answer is OK. IMO using the re module is an overkill in this case. Splitting the string with ; and replacing the entire cell value with the empty string is better. — bergerg
– bergerg, Commented Oct 7, 2017 at 14:01

janos · Accepted Answer · 2017-10-07 14:08:30Z

2

The [find, replacement] pairs are not well-suited for your purpose. Replacing ; + value + ; with ;; is really just a complicated way of saying that you want to remove columns with value.

So instead of using the [find, replacement] pairs, it will be more natural and straightforward to split the line on ; to fields, replace the values that are considered junk with empty string, and then join the values again:

JUNK = frozenset(['#DIV/0!', 'B.C.'])

def clean(s):
    return ';'.join(map(lambda x: '' if x in JUNK else x, s.split(';')))

You could use this function in your implementation (or copy it inline):

def findReplace(directory, filePattern):
    for path, dirs, files in os.walk(os.path.abspath(directory)):
        for filename in fnmatch.filter(files, filePattern):
            filepath = os.path.join(path, filename)

            cleaned_lines = []
            with open(filepath) as f:
                for line in f.read():
                    cleaned_lines.append(clean(line))

            with open(filepath, "w") as f:
                f.write('\n'.join(cleaned_lines))

edited Oct 7, 2017 at 14:08

answered Oct 7, 2017 at 12:44

janos

126k31 gold badges242 silver badges253 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Sagar Joshi Over a year ago

Corrected the typo. I don't know much about how it exactly works but from what I understand, the program is picking up the first and the last ;B.C.; as strings to be replaced and ignoring the one in the middle.

janos Over a year ago

@SagarJoshi oh I see. I rewrote my answer.

janos Over a year ago

@SagarJoshi do you need more help with this?

trent · Accepted Answer · 2017-10-07 13:17:32Z

1

str.replace, once it has made one replacement, continues scanning from the next character after the last thing it replaced. So when two ;B.C.;s overlap, it will not replace both.

You can use the re module to replace B.C. only when it occurs between two ;, using lookahead and lookbehind assertions:

>>> import re
>>> s = "103273;CAN D MAT;B.C.;B.C.;B.C.;03-Apr-2006"
>>> re.sub(r'(?<=;)B[.]C[.](?=;)', "", s)
'103273;CAN D MAT;;;;03-Apr-2006'

... But in this case it may be better to split the line into fields on ;, replace the fields that match the strings you want to erase, and join the strings together again.

>>> fields = s.split(';')
>>> for i, f in enumerate(fields):
...     if f in ('B.C.', '#DIV/0!'):
...         fields[i] = ''
... 
>>> ';'.join(fields)
'103273;CAN D MAT;;;;03-Apr-2006'

This has two main advantages: you don't have to write a fairly complex regular expression for each replaced string; and it will still work if one of the fields is at the beginning or end of the line.

For any CSV parsing more complicated than this (for example, if any fields can contain quoted ; characters, or if the file has a header that should be skipped), look into the csv module.

edited Oct 7, 2017 at 13:17

answered Oct 7, 2017 at 13:10

trent

28.6k10 gold badges63 silver badges100 bronze badges

3 Comments

Sagar Joshi Over a year ago

I'll try this out. I'm not very sure of the joining the strings part. The data is a bit of a mess and it contains commas and semi colons both as part of cell values as well. (The file is csv and has semicolon delimiter but the string also has these characters)

trent Over a year ago

@SagarJoshi If the literal semicolons are quoted, e.g. appear in the data as a\;b or "a;b", then you should use the csv module to parse it. If that won't work, regular expressions may be the best way to go (although not necessarily exactly as I've done here).

trent Over a year ago

(It may be worth stating explicitly that csv supports many dialects, such as ;-delimited and various forms of quoting. Dialects)

Collectives™ on Stack Overflow

Issue with Find and Replace using Python

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related