0

I have a set of .csv files with ; delimiter. There are certain junk values in the data that I need to replace with blanks. A sample problem row is:

103273;CAN D MAT;B.C.;B.C.;B.C.;03-Apr-2006

Desired row after find and replace is:

103273;CAN D MAT;;;;03-Apr-2006

In the above example I'm replacing ;B.C.; with ;;

I cannot replace just B.C. as I need to match the entire cell value for this particular error case. The code that I am using is:

import os, fnmatch

def findReplace(directory, filePattern):
        for path, dirs, files in os.walk(os.path.abspath(directory)):
            for filename in fnmatch.filter(files, filePattern):
                filepath = os.path.join(path, filename)
                with open(filepath) as f:
                    s = f.read()
                for [find, replace] in zip([';#DIV/0!;',';B.C.;'],[';;',';;']        
                    s = s.replace(find, replace)
                with open(filepath, "w") as f:
                    f.write(s)

findReplace(*Path*, "*.csv")

The output that I'm instead getting is:

103273;CAN D MAT;;B.C.;;03-Apr-2006

Can someone please help with this issue?

Thanks in advance!

5
  • so basically you want to replace #DIV/0! and B.C. with `` (empty string). Why not just do that? With straight forward approach. Commented Oct 7, 2017 at 12:42
  • The posted program will give '103273;CAN D MAT;;;;;;;03-Apr-2006' for the example input, which is different from what you wrote. Commented Oct 7, 2017 at 12:43
  • @nutmeg: I also have the phrase B.C. coming in elsewhere (as a part of a string in a cell). I just wish to replace where the entire cell value matched this. Also, these two values are just representative. I have about 20 other values to replace such as "January," with "January". Also, I am new to python so not really sure what you mean by straight forward approach. Commented Oct 7, 2017 at 13:56
  • @janos: Thanks for pointing out. Had missed out a semi-colon in the Find string. Commented Oct 7, 2017 at 13:58
  • @SagarJoshi I probably misunderstood you. In that case, trentcl's answer is OK. IMO using the re module is an overkill in this case. Splitting the string with ; and replacing the entire cell value with the empty string is better. Commented Oct 7, 2017 at 14:01

2 Answers 2

2

The [find, replacement] pairs are not well-suited for your purpose. Replacing ; + value + ; with ;; is really just a complicated way of saying that you want to remove columns with value.

So instead of using the [find, replacement] pairs, it will be more natural and straightforward to split the line on ; to fields, replace the values that are considered junk with empty string, and then join the values again:

JUNK = frozenset(['#DIV/0!', 'B.C.'])

def clean(s):
    return ';'.join(map(lambda x: '' if x in JUNK else x, s.split(';')))

You could use this function in your implementation (or copy it inline):

def findReplace(directory, filePattern):
    for path, dirs, files in os.walk(os.path.abspath(directory)):
        for filename in fnmatch.filter(files, filePattern):
            filepath = os.path.join(path, filename)

            cleaned_lines = []
            with open(filepath) as f:
                for line in f.read():
                    cleaned_lines.append(clean(line))

            with open(filepath, "w") as f:
                f.write('\n'.join(cleaned_lines))
Sign up to request clarification or add additional context in comments.

3 Comments

Corrected the typo. I don't know much about how it exactly works but from what I understand, the program is picking up the first and the last ;B.C.; as strings to be replaced and ignoring the one in the middle.
@SagarJoshi oh I see. I rewrote my answer.
@SagarJoshi do you need more help with this?
1

str.replace, once it has made one replacement, continues scanning from the next character after the last thing it replaced. So when two ;B.C.;s overlap, it will not replace both.

You can use the re module to replace B.C. only when it occurs between two ;, using lookahead and lookbehind assertions:

>>> import re
>>> s = "103273;CAN D MAT;B.C.;B.C.;B.C.;03-Apr-2006"
>>> re.sub(r'(?<=;)B[.]C[.](?=;)', "", s)
'103273;CAN D MAT;;;;03-Apr-2006'

... But in this case it may be better to split the line into fields on ;, replace the fields that match the strings you want to erase, and join the strings together again.

>>> fields = s.split(';')
>>> for i, f in enumerate(fields):
...     if f in ('B.C.', '#DIV/0!'):
...         fields[i] = ''
... 
>>> ';'.join(fields)
'103273;CAN D MAT;;;;03-Apr-2006'

This has two main advantages: you don't have to write a fairly complex regular expression for each replaced string; and it will still work if one of the fields is at the beginning or end of the line.

For any CSV parsing more complicated than this (for example, if any fields can contain quoted ; characters, or if the file has a header that should be skipped), look into the csv module.

3 Comments

I'll try this out. I'm not very sure of the joining the strings part. The data is a bit of a mess and it contains commas and semi colons both as part of cell values as well. (The file is csv and has semicolon delimiter but the string also has these characters)
@SagarJoshi If the literal semicolons are quoted, e.g. appear in the data as a\;b or "a;b", then you should use the csv module to parse it. If that won't work, regular expressions may be the best way to go (although not necessarily exactly as I've done here).
(It may be worth stating explicitly that csv supports many dialects, such as ;-delimited and various forms of quoting. Dialects)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.